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Preface 



This book is intended for use by scientists, engineers, and students interested in 
sequence learning in artificial intelligence, neural networks, and cognitive science. 
The book will introduce essential algorithms and models of sequence learning 
and develop them in various ways. With the help of these concepts, a variety 
of applications will be examined. This book will allow the reader to acquire an 
appreciation of the breadth and variety within the field of sequence learning and 
its potential as an interesting area of research and application. The reader is 
presumed to have basic knowledge of neural networks and AI concepts. 

Sequential behavior is essential to intelligence and a fundamental part of 
human activities ranging from reasoning to language, and from everyday skills 
to complex problem solving. Sequence learning is an important component of 
learning in many task domains — planning, reasoning, robotics, natural language 
processing, speech recognition, adaptive control, time series prediction, and so 
on. Naturally, there are many different approaches towards sequence learning. 
These approaches deal with different aspects of sequence learning. This book 
provides an overall framework for this field. 
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Ron Sun 

CECS Department, University of Missouri 
Columbia, MO 65211, USA 



1 Introduction 

Sequential behavior is essential to intelligence, and it is a fundamental part of 
human activities ranging from reasoning to language, and from everyday skills 
to complex problem solving. In particular, sequence learning is an important 
component of learning in many task domains — planning, reasoning, robotics, 
natural language processing, speech recognition, adaptive control, time series 
prediction, financial engineering, DNA sequencing, and so on. 

Naturally, there are many different approaches towards sequence learning, 
resulting from different perspectives taken in different task domains. These ap- 
proaches deal with somewhat differently formulated sequential learning problems 
(for example, some with actions and some without), and/or different aspects of 
sequence learning (for example, sequence prediction vs. sequence recognition). 

Sequence learning is clearly a difficult task. More powerful algorithms for 
sequence learning are needed in all of these afore-mentioned domains. It is our 
view that the right approach to develop better techniques, algorithms, models, 
and theories is to first better understand the state of the art in different dis- 
ciplines related to this topic. There seems to be a need to compare, contrast, 
and combine different existing techniques, approaches, and paradigms, in order 
to develop better and more powerful algorithms. Currently, existing techniques 
and algorithms include recurrent neural networks, hidden Markov models, dy- 
namic programming, reinforcement learning, graph theoretical models, search 
based models, evolutionary computational models, symbolic planning models, 
production rule based models, and so on. 

Especially important to this topic area is the fact that work on sequence 
learning has been going on in several different disciplines such as artificial in- 
telligence, neural networks, cognitive science (human sequence learning, e.g., 
in skill acquisition), and engineering. We need to examine the field in a cross- 
disciplinary way and take into consideration all of these different perspectives 
on sequence learning. Thus, we need interdisciplinary gatherings that include re- 
searchers from all of these orientations and disciplines, beyond narrowly focused 
meetings on specialized topics such as reinforcement learning or recurrent neural 
networks. 

A workshop on neural, symbolic, and reinforcement methods for sequence 
learning (co-chaired by Ron Sun and Lee Giles) was held on August 1st, 1999, 
preceding IJCAr99, in Stockholm, Sweden. The following issues concerning se- 
quence learning were raised and addressed: 
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(1) Underlying similarities and differences among different models, including 
(1.1) problem formulation (ontological issues), (1.2) mathematical comparisons, 
(1.3) task appropriateness, and (1.4) performance analysis and bounds. 

(2) New and old models, and their capabilities and limitations, including (2.1) 
theory, (2.2) implementation, (2.3) performance, and (2.4) empirical comparisons 
in various domains. 

(3) Hybrid models, including (3.1) foundations for synthesis or hybridization, 
and (3.2) advantages, problems, and issues of hybrid models. 

(4) Successful sequence learning applications and future extensions, includ- 
ing (4.1) examples of successful applications, (4.2) generalization and transfer 
of successful applications, and (4.3) what is needed for further enhancing the 
performance of these successful applications. 

In this book, we will be mostly concerned with items 1, 2 and 4, focusing 
on exploring similarities, differences, capabilities, and limitations of existing and 
new models, in relation to various task domains and various sequence learning 
problems. 



2 Problem Formulations and Their Relationships 

One aim of the present volume is to better understand different formulations of 
sequence learning problems. 

With some necessary simplification, we can categorize various sequence learn- 
ing problems that have been tackled into the following categories: (1) sequence 
prediction, in which we want to predict elements of a sequence based on the 
preceding element(s); (2) sequence generation, in which we want to generate 
elements of a sequence one by one in their natural order; and (3) sequence 
recognition, in which we want to determine if a sequence is a legitimate one 
according to some criteria; in addition, (4) sequential decision making involves 
selecting a sequence of actions, to accomplish a goal, to follow a trajectory, or 
to maximize (or minimize) a reinforcement (or cost) function that is normally 
the (discounted) sum of reinforcements (costs) that are received along the way 
(see Bellman 1957, Bertsekas and Tsitsiklis 1995). 

These different sequence learning problems can be more precisely formulated 
as follows (assume a deterministic world for now): 

— Sequence prediction: Si, Si+i, Sj — > Sj-i-ii where 1 < i < j < oo; that 
is, given Si, Si+i, Sj, we want to predict Sj+i. When i = 1, we make 
predictions based on all of the previously seen elements of the sequence. 
When i = j, we make predictions based only on the immediately preceding 
element. 

— Sequence generation: Si, Si+i, ...., Sj — > Sj-i-i) where 1 < i < j < oo; that is, 
given Si, Si+i, ...., Sj, we want to generate Sj+i. (Put in this way, it is clear 
that sequence prediction and generation are essentially the same task.) 

— Sequence recognition: Si, Si+i, Sj — > yes or no, where 1 < i < j < oo; 
that is, given Si, Si+i, Sj, we want to determine if this subsequence is 
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legitimate or not. (There are alternative ways of formulating the sequence 
recognition problem, for example, as an one-shot recognition process, as 
opposed to an incremental step-by-step recognition process as formulated 
here.) 

With this formulation, sequence recognition can be turned into sequence 
generation/prediction, by basing recognition on prediction (see the chapter 
by D. Wang in this volume); that is, Si,Si+i, — > yes (a recognition 

problem), if and only if Si, Si+i, ...., Sj-i — > Sj (a prediction problem) and 
Sj = Sj, where Sj is the prediction and is the actual element. 

Sequence learning (either generation, prediction, or recognition) is usually 
based on models of legitimate sequences, which can be developed through 
training with exemplars. Models may be in the form of Markov chains, hidden 
Markov models, recurrent neural networks, and a variety of other forms. 
Expectation-maximization, gradient descent, or clustering may be used in 
training (see e.g. the chapters by Oates and Cohen and by Sebastiani et al 
in this volume). Such training may extract “central tendencies” from a set 
of exemplars. 

— Sequential decision making (that is, sequence generation through action- 
s): there are several possible variations. In the goal oriented case, we have 
Si, Si+i, ...., Sj] sg — > Uj, where 1 < i < j < oo; that is, given Si, Si+i, ...., Sj 
and the goal state sq, we want to choose an action Uj at time step j 
that will likely lead to sg in the future. In the trajectory oriented case, 
we have Si, Si+i, ...., s^+i — > aj, where 1 < i < j < oo; that is, given 
Si, Si+i, ...., Sj and the desired next state Sj+i, we want to choose an action 
Oj at time step j that will likely lead to in the next step. In the reinforce- 
ment maximizing case, we have Si, Si+i, ...., Sj — > aj, where 1 < i < j < oo; 
that is, given Si, Si+i, Sj we want to choose an action aj at time step 
j that will likely lead to receiving maximum total reinforcement in the fu- 
ture. The calculation of total reinforcement can be in terms of discounted or 
undiscounted cumulative reinforcement, in terms of average reinforcement, 
or in terms of some other functions of reinforcement (Bertsekas and Tsitsiklis 
1996, Kaelbling et al 1996, Sutton and Barto 1997). 

The action selection can be based on the immediately preceding element 
(i.e., i = j), in which case a Markovian action policy is in force; otherwise, a 
non-Markovian action policy is involved (McCallum 1996, Sun and Sessions 
2000). Yet another possibility is that the action decisions are not based on 
preceding elements at all (i.e., Si, Si+i, Sj are irrelevant), in which case 
an open-loop policy is used (Sun and Sessions 1998) while in the previous 
two cases, closed-loop policies are involved, as will be discussed later. 

Note that, on this view, sequence generation/prediction can be viewed as a 
special case of sequential decision making. 

The above exposition reveals the relationship among different categories of se- 
quence learning, under the assumption of a deterministic world. Another as- 
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sumption in the above discussion is that we considered only closed-loop situa- 
tions: that is, we deal only with one step beyond what is known or done. For 
open-loop prediction, generation, or action, we can have the following formal- 
izations: 



— Sequence prediction: Si, Si+i, Sj — > Sj-i-i) ••••) Sj+k, where 1 < i < j < oo 
and fc > 1; that is, given Si, Si+i, Sj, we want to be able to predict 
Sj+l, ■•••, Sj+fe. 

— Sequence generation: Si, Si+i , sj — > Sj+i, Sj+fe, where 1 < i < j < oo 
and fc > 1. 

— Sequential decision making: Si, Si+i, Sj\G — > Uj, ....,aj+k, where 1 < 
i < j < 00 and fc > 1 ; that is, given Si, Si+i , Sj and a goal state, a goal 
trajectory, or a reinforcement maximizing goal, we want to choose actions 
Gj through Uj+k that may lead to that goal. 



The formulation of sequence recognition (either incremental or one-shot) is not 
changed, as sequence recognition has nothing to do with being either closed-loop 
or open-loop. 

To extend our discussion from a deterministic world to a stochastic world, 
we may specify a probability distribution in place of a deterministic predic- 
tion/generation as adopted above. Predictions and generations can then be prob- 
abilistic: a distribution is calculated concerning an element in a sequence: 

p(sj+i|si Sj). Recognition may also be probabilistic: a probability p(sj+i) may 

be calculated so as to determine how likely a sequence is a legitimate one. In 
cases where action is involved, the transition to a new element (i.e., a new state) 
after an action is performed can be stochastic (determined by a probability dis- 
tribution): pa^(sj, Sj+i), where action Uj is performed in state Sj at step j, and 
Sj+i is a new state resulting from the action. The action selection in such cases 
can also be probabilistic, whereby an action is selected from an action probabil- 
ity distribution p(sj,aj), where Sj is a state and Uj is an action, In such cases, 
we may find the best stochastic action policy, i.e., the optimal action probability 
distribution in each state. 

In addition to addressing these types of tasks, there are also other models that 
address additional issues arising out of, or along with, these above categories of 
sequence learning tasks. For example, we may want to segment a sequence (i.e., 
perform sequential segmentation) for the sake of compressing the description of 
sequences (Nevill-Manning and Witten 1997), or for the sake of better dealing 
with temporal dependencies (Sun and Sessions 2000, McCallum 1996). We may 
even form hierarchies of sequential segmentation (see the chapter by Sun and 
Sessions in this volume) . We may also create modular structures during sequence 
learning (i.e, perform modularization) to facilitate learning processes, such as in 
Singh (1994), Tham (1995), Thrun and Schwartz (1995), and Sun and Peterson 
(1999). In addition, sequence filtering is also a common task. 
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3 Characterizing Sequence Learning Models 

There are many existing models for different types of sequence learning discussed 
above. They vary greatly in terms of applicable domain, emphasis, or learning 
algorithm. We can characterize them along several major dimensions as follows: 

— learning paradigm (whether supervised, unsupervised, reinforcement based, 
or knowledge based learning is used) 

— implementation paradigm (whether neural networks, symbolic rules, or lookup 
tables are used for implementation) 

— whether the world is probabilistic or deterministic (in terms of the determi- 
nation of next elements) 

— whether the world is Markovian or non-Markovian 

— whether the task is closed-loop or open-loop 

— whether action is involved or not 

— whether an action policy is deterministic or stochastic (when action is se- 
lected) 

— applicable domains — e.g., market agents (see the chapter by Tesauro), 
speech and language processing (e.g., Wermter et al 1996), navigation learn- 
ing (e.g.. Sun and Peterson 1998, 1999), or motor sequence learning (see the 
chapters by Bapi and Doya and by Grossberg and Paine) . 

Let us briefly review and compare several major approaches to sequence 
learning, such as recurrent neural networks, symbolic planning, dynamic pro- 
gramming (value iteration or policy iteration), and reinforcement learning (such 
as Q-learning), while keeping in mind these afore-mentioned dimensions. 

First of all, there are a number of neural network models for dealing with 
sequences, such as recurrent backpropagation networks (see the review chapters 
by Sona and Sperduti and by Chappelier et al). Such networks typically use 
hidden nodes as memory, which represents a condensed record of the previously 
encountered subsequence. The hidden nodes are connected back to the input n- 
odes and are used in subsequent time steps, so that the previous states are taken 
into account. Recurrent backpropagation networks are widely used for sequence 
prediction and generation (Frasconi et al 1995, Giles et al 1995). A potential 
shortcoming of such networks is that they require supervised learning and there- 
fore may not be suitable for typical sequential decision making learning scenarios 
(whereby there is no teacher input, beside sparse reinforcement signals). There 
have also been associative networks proposed for sequence learning (see, e.g., 
Guyon et al 1988, Kleinfeld and Sompolinsky 1988). In this volume, Baldi et 
al, Sona and Sperduti, and Chappelier et al discussed various forms of recurrent 
backpropagation networks. Wang discussed more structured networks of his own 
design for sequence learning. 

A good solution for learning sequential decision making, as well as sequence 
prediction/generation, is the temporal difference method, as described by e.g. 
Sutton and Barto (1997) and Watkins (1989). In general, there is an evaluation 
function and/or an action policy. The evaluation function generates a value. 
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e{x), for a current state (input) x, which measures the goodness of x. An action 
a is chosen according to a certain action policy, based on the evaluation function 
e{x). An action performed leads to a new state y and a reinforcement signal 
r. We then modify (the parameters of) the evaluation function so that e{x) is 
closer to r + 7e(y), where 0 < 7 < 1 is a discount factor. At the same time, 
the action policy may also be updated, to strengthen or weaken the tendency 
to perform the chosen action a, according to the error in evaluating the state: 
r + 7e(?/) — e{x). That is, if the situation is getting better because of the action, 
increase the tendency to perform that action; otherwise, reduce the tendency. 
The learning process is dependent upon the temporal difference in evaluating 
each state. Q-learning (Watkins 1989) is a variation of this method, in which 
the policy and the evaluation function are merged into one function Q(x,a), 
where x is the current state and a is an action. Much existing work shows that 
Q-learning is as good as any other reinforcement learning algorithms (Lin 1992, 
Tesauro 1992). However, reinforcement learning is problematic when the input 
state space is continuous or otherwise large in which case reinforcement learning 
using table lookup is no longer applicable, and connectionist implementations 
are not guaranteed successful learning (Lin 1992, Tesauro 1992). In this volume, 
the chapters by Schmidhuber, by Sun and Sessions, by Choi et al, and by Tesauro 
are concerned with various extensions of reinforcement learning, for dealing with 
non-Markovian dependencies, co-learning in multi-agent systems, and so on. 

Another approach for sequential decision making is explicit symbolic plan- 
ning as developed in artificial intelligence. From an overall goal, a number of 
subgoals are generated that are to be accomplished in a certain order; then, 
from each of these subgoals, a number of subsubgoals are generated, and so on, 
until each of these goals can be accomplished in one step. Subgoals are generated 
based on domain knowledge. Alternatively, partial ordering of primitive actions 
can be incrementally established from known domain knowledge, and then a to- 
tal ordering of actions is produced based on partial ordering, which constitutes 
a plan for action sequencing. The shortcomings of the planning approach are 
the following: (1) a substantial amount of prior knowledge is required in this 
approach, which is not always readily available; (2) a high computational com- 
plexity is inherent in this approach; (3) symbolic planning may be unnatural 
for describing some simple reactive sequential behavior, as pointed out by ad- 
vocates of situated action (e.g., Agre and Chapman 1990). Recently, Inductive 
Logic Programming techniques (Lavrac and Dzeroski 1994) have been devel- 
oped, which can be used to learn symbolic knowledge from exemplars and thus 
partially remedies one problem of symbolic planning. 

Hidden Markov models (HMM) can be used for sequence learning (including 
generation and recognition; Baum 1972). This type of model learns underlying 
state transition probability distributions from which overt observation data are 
generated. Sequence generation can evidently be handled by this type of model. 
Sequence recognition can also be accomplished based on hidden Markov models 
by computing probabilities of generating sequences from underlying models. In 
this volume, a number of chapters deal with extending and enhancing known 
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HMM algorithms, by considering, e.g., clustering of HMMs (as in Oates and 
Cohen), multiple HMMs with transitions among them (as in Choi et al), or 
incorporating input/output relations and bidirectional (forward and backward) 
predictions (as in Baldi et al). 

Another approach suitable for a variety of types of sequence learning is evolu- 
tionary computation (EC) . EC is a “weak” method for knowledge-lean heuristic 
search. It updates its knowledge based (mostly) on the acquired experience of an 
entire generation (each member of which goes through many training trials) at 
the end of all their trials. It can be used to tackle all types of sequence learning 
because of the generality of this learning method (see, e.g., Grefenstette 1992, 
Meeden 1995, as well as the chapter by Schmidhuber in this volume). 

There are also other ways in which sequence learning can be performed, 
extensions and variations of existing methods can be explored, and advances 
beyond existing work can be made (see, e.g., Wang and Alkon 1993, Giles and 
Gori 1998). 

Given the variety of different approaches for sequence learning, we may com- 
pare, and combine, some of these approaches. In this volume, the chapter by Sona 
and Sperduti and, to a lesser extent, the chapter by Ghappelier et al compare 
(informally) a variety of different models for sequence learning, and illustrate 
their respective strengths and weaknesses. Glearly, these different techniques are 
suitable for somewhat different learning situations, and a clear and formal un- 
derstanding of this aspect is yet to be achieved. In terms of combining different 
techniques to form some sorts of hybrid models, there are clearly many possi- 
bilities, for example, combining symbolic rules and neural networks for imple- 
menting reinforcement learning (Sun and Peterson 1998), or combining symbolic 
planning and reinforcement learning to produce action sequences (Dearden and 
Boutilier 1997, Sun and Sessions 1998). There are also proposals for combin- 
ing EC and recurrent neural networks for sequence learning. Hybridization is 
definitely an issue worthy of further consideration. 

4 Applications 

A number of applications of sequence learning are reported in this volume. For 
example, Baldi et al discussed protein secondary structure prediction. Sabastiani 
et al focused on robot sensory processing. Bapi and Doya analyzed motor se- 
quence learning. Tesauro explored automatic pricing in e-commerce. Zaki dealt 
with data mining in large databases of sequential data. 

There are, of course, many other areas of applications for sequence learning 
that are not covered by the present volume due to space limitation. In fact, 
sequence learning is arguably the most prevalent form of human and animal 
learning. In classical studies of instrumental conditioning, sequencing is central 
(Anderson 1995). In human skill learning, sequentiality is often the most im- 
portant aspect (Willingham et al 1989, Sun et al 2000; see also the chapter by 
Lebiere and Wallach in this volume). In human high-level problem solving and 
reasoning, sequences are also essential (Anderson 1995). Therefore, when build- 
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ing intelligent systems to emulate human intelligence and cognition, we must 
pay serious attention to sequences, including sequence learning. 



5 Further Issues 

One difficult issue in sequence learning is temporal (non-Markovian) dependen- 
cies, especially long-range dependencies. A current situation may be dependent 
on what happened before, and sometimes on what happened a long time ago. 
Many existing models have difficulties dealing with such temporal dependen- 
cies. For example, it is well known that many recurrent neural networks models 
cannot deal well with long-range dependencies (see the chapter by Chappelier 
et al). Dealing with long-range dependencies is even harder for reinforcement 
learning. Many heuristic methods (e.g., McCallum 1996, Sun and Sessions 2000) 
may help to facilitate learning of temporal dependencies to some extent, but 
they break down in cases of long-range dependencies. The methods described in 
the chapters by Zaki and by Schmidhuber may have some potential in dealing 
with this issue. 

Another issue in sequence learning is hierarchical structuring of sequences. 
Many real-world problems give rise to sequences that have clear hierarchical 
structures: a sequence is made up of subsequences and they in turn are made 
up of subsubsequences, and so on. The difficulty lies in how we automatically 
identify these subsequences and deal with them accordingly. This issue is some- 
what related to the issue of temporal dependencies, as hierarchical structures are 
often determined based on temporal dependencies. Learning hierarchical struc- 
tures may help to reduce or even eliminate temporal dependencies. It may also 
help to compress the description of sequences. For learning hierarchical sequences 
with no action involved, there are some existing methods for segmentation (e.g., 
Nevill-Manning and Witten 1997). For learning hierarchical action sequences, the 
reinforcement learning methods developed in the chapter by Sun and Sessions 
showed some potential. 

Yet another issue concerns complex or chaotic sequences often encountered in 
real-world applications such as time series prediction or neurobiological modeling 
(Wang and Alkon 1993). There has been a substantial amount of work in this 
area. However, not enough understanding has been achieved. Much more work is 
needed and further progress is expected in this area. The same can be said about 
hybridization of existing sequence learning techniques as mentioned earlier. 



6 Concluding Remarks 

The importance of sequential behavior and sequence learning in intelligence and 
cognition cannot be over-estimated. The mistake of earlier AI and Machine 
Learning has been downplaying the role of sequential behavior and sequence 
learning (which has been corrected to some large extent recently) . We hope that 
this book serves to emphasize, and bring to the attention of the scientific and 
engineering communities, the importance of this problem. In this introduction. 
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I tried to present a unified view of a variety of sequence learning formulations, 
paradigms, and approaches, to give some coherence to this topic. This book, I 
hope, is the first step toward establishing a more unified, more coherent concep- 
tual framework for the study of sequence learning of various types. 
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1 Introduction 

Suppose one has a set of univariate time series generated by one or more unknown 
processes. The problem we wish to solve is to discover the most probable set of 
processes generating the data by clustering time series into groups so that the 
elements of each group have similar dynamics. For example, if a batch of time 
series represents sensory experiences of a mobile robot, clustering by dynamics 
might find clusters corresponding to abstractions of sensory inputs (Ramoni, 
Sebastiani, Cohen, Warwick, & Davis, 1999). 

The method presented in this chapter transforms each time series into a 
Markov Chain (mc) and then clusters time series generated by the same MCs. A 
MC represents a dynamic process as a transition probability matrix. If we regard 
each time series as being generated by a stochastic variable, we can construct a 
transition probability matrix for each observed time series. Each row and column 
in the matrix represents a state of the stochastic variable, and the cell values are 
the probabilities of transition from one state to each other state of this variable 
in the next time step. A transition probability matrix is learned for each time 
series in a training batch of time series. Next, a Bayesian clustering algorithm 
groups time series that produce similar transition probability matrices. The task 
of the clustering algorithm is two-fold: to find the set of clusters that gives the 
best partition of time series according to some measure, and to assign each time 
series to one cluster. 

Clustering can be simply a matter of grouping objects together so that the 
average similarity of a pair of objects is higher when they are in the same 
group and low when they are in different groups. Numerous clustering algorithms 
have been developed along this principle (Fisher, 2000). Our algorithm, called 
Bayesian Clustering by Dynamics (bcd), does not use a measure of similarity 
between MCs to decide if two time series need to belong to the same cluster but it 
uses a different principle: Both the decision of whether to group time series and 
the stopping criterion are based on the posterior probability of the clustering, 
that is, the probability of the clustering conditional on the data observed. In 
other words, two time series are assigned to the same cluster if this operation in- 
creases the posterior probability of the clustering, and the algorithm stops when 
the posterior probability of the clustering is maximum. Said in yet another way. 
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BCD solves a Bayesian model selection problem, where the model it seeks is the 
most probable partition of time series given the data. To increase efficiency, the 
algorithm uses an entropy-based heuristic and performs a hill-climbing search 
through the space of possible partitions, so it yields a local-maximum posterior 
probability clustering. 

The algorithm produces a set of clusters, where each cluster is identified by 
a MC estimated from the time series grouped in the cluster. Given the model- 
based probabilistic nature of the algorithm, clusters induced from the batch of 
time series have a probability distribution which can be used for reasoning and 
prediction so that, for example, one can detect cluster membership of a new time 
series or forecast future values. Although a MC is a very simple description of a 
dynamic process, BCD has been applied successfully to cluster robot experiences 
based on sensory inputs (Ramoni et ah, 1999), simulated war games (Sebastian!, 
Ramoni, Cohen, Warwick, & Davis, 1999), as well as the behavior of stocks 
in market and automated learning and generation of Bach’s counterpoint. A 
conjecture of the success of the algorithm is that describing a dynamic process 
as a MC can be enough to capture the common dynamics of different time series 
without resorting to complex models as, for example. Hidden Markov Models 
(Rabiner, 1989). 

The reminder of this chapter is organized as follows. We describe BCD in 
Section 2. Section 3 shows how to make classification and prediction with clusters 
of dynamics. An application of the algorithm to cluster robot sensory inputs is 
in Section 4. Section 5 describes related and future work and conclusions are in 
the last section of this chapter. 



2 Bayesian Clustering by Dynamics 

Suppose we are given a batch of m time series that record the values l,2,...,s 
of a variable X. Consider, for example, the plot of three time series in Figure 1. 
Each time series records the values of a variable with five states — labeled 1 to 5 
— in 50 time steps. It is not obvious that the three time series are observations 
of the same process. However, when we explore the underlying dynamics of 
the three series more closely, we find, for example, that state 2 is frequently 
followed by state 1, and state 3 is followed disproportionately often by state 
1. We are interested in extracting these types of similarities among time series 
to identify time series that exhibit similar dynamics. To cluster time series by 
their dynamics, we model time series as MCs. For each time series, we estimate 
a transition matrix from data and then we cluster similar transition matrices. 



2.1 Learning Markov Chains 

Suppose we observe a time series S = (xq, xi,X 2, ■■■, Xt-i,Xt, ..), where each xt is 
one of the states 1, ..., s of a variable X. The process generating the sequence S 
is a first order MC if the conditional probability that the variable X visits state j 
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Fig. 1. Plot of three time series 

at time t, given the sequence (xo,a;i,X 2 , is only a function of the state 

visited at time t — 1 (Ross, 1996). Hence, we write 

P{Xt = j\{xo,Xi,X 2 , = p{Xt = j\xt-l), 

with Xt denoting the variable X at time t. In other words, the probability dis- 
tribution of the variable X at time t is conditionally independent of the values 
(xq, x\,X 2 , Xt- 2 ), once we know Xt-i- This conditional independence assump- 

tion allows us to represent a MC as a vector of probabilities Pq = {poi,Po 2 , ■■■,Pos), 
denoting the distribution of Xq (the initial state of the chain) and a matrix P 
of transition probabilities, where pij = p{Xt = j\Xt-i = i), so that 



P 



Given a time series generated by a MC, we wish to estimate the probabilities pij 
of state transitions (z — >■ j) = {Xt-i = i ^ Xt = j) from the data. This is a well 
known statistical estimation problem whose solution is to estimate Pij as nijjni, 
where nt = and is the frequency of the transitions (z — >■ j) observed 

in the time series (Bishop, Fienberg, & Holland, 1975). Briefly, the estimate is 
found in this way: 
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1. First, we need to identify the sampling model, that is, the probability distri- 
bution from which the observed data were generated. Typically, the sampling 
model is known up to a vector of unknown parameters 6, which we wish to 
estimate from the data. In our problem, the sampling model is the transition 
probability matrix P, which is a set of independent discrete distributions, 
one for each row of P. The vector of parameter 6 is given by the set of 
conditional probabilities (pij). 

2. The sample data S and the sampling model allow us to write down the 
likelihood function p{S\0): The probability of the data given the sampling 
model and 9. Since data are observed, p{S\9) is only a function of 6 and 
we estimate 9 by finding the value which maximizes the likelihood function. 
This procedure returns the Maximum Likelihood estimate of 9 and, hence, 
the parameter value that makes the observed data most likely. 



The assumption that the generating process is a MC implies that transitions from 
state i of the variable X are independent of transitions from any of the other 
states of the variables. Therefore, rows of the transition probability matrix P 
are independent distributions. Data relevant to the fth row distribution are the 
Hi transitions (z — >■ j), for any j, observed in the time series and the probability 
of observing the transitions (z — >■ 1), • • •, (z — >■ s) with frequencies nn, • • • , Uis is 
Pi^ ■ ^ discrete random variable with probability mass function proportional 
to Pij^ has a multinomial distribution (Bishop et ah, 1975). By independence, 
the likelihood function is the product: 



p{s\9) = l[l[p:;^ ( 1 ) 

i=li=l 

and depends on the data only via Uij. Maximization of p(S'|0), with the constraint 
that ^jPij = 1, for all i, returns the estimate pij = nij/ui. 

This estimate uses only the observed data while one may have some prior 
information about a MC to take into account during the estimation of 9. A 
Bayesian approach provides a formal way to use both prior information and 
data to estimate 9. This is achieved by regarding 0 as a random variable, whose 
density p{9) encodes prior knowledge. Data are used to update the prior density 
of 9 into the posterior density p(6*|<5') by Bayes’ Theorem: 



p{9\S) 



p{S\9)p{9) 

P{S) 



and the estimate of 9 is the posterior expectation of 9 (Ramoni & Sebastiani, 
1999). 

To choose the prior, we suppose we have some background knowledge that 
can be represented in terms of a hypothetical time series of length a — -I- 1 

in which the a — s^ transitions are divided into atj — 1 transitions of type 
(z — >■ j). This background knowledge gives rise to a s x s contingency table, 
homologous to the frequency table, containing these hypothetical transitions 
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aij — 1 that are used to formulate a conjugate prior^ with density function 
p {^) « uUi rii= 1 Pij^ This is the density function of s independent Dirichlet 
distributions, with hyper-parameters aij . Each Dirichlet distribution is a prior to 
the parameters pn , • • • , Pis associated with the ith row conditional distribution 
of the matrix P. Standard notation denotes one Dirichlet distribution associated 
with the conditional probabilities (pn, ■■■,Pis) by D{an, Ois). The distribution 
given by independent Dirichlet is called a Hyper-Dirichlet distributions (Dawid & 
Lauritzen, 1993) and it is commonly used to model prior knowledge in Bayesian 
networks (Ramoni & Sebastiani, 1999). We will denote such a Hyper-Dirichlet 
distribution by HD(piij)s, where the index s denotes the number of independent 
Dirichlet distributions defining the Hyper-Dirichlet. 

By letting Ui denote a^ , this prior distribution assigns probability a^ jai 
to the transition {i — >■ j), with variance {aij/ai){l — aij j a^ j (o.i P 1). For fixed 
otijjai, the variance is a decreasing function of ai and, since small variance 
implies a large precision about the estimate, ai is called the local precision about 
the conditional distribution {pn, - ■ ■ ,Pis) and indicates the level of confidence 
about the prior specification. The quantity a = ai is the global precision, as 
it accounts for the level of precision of all the s conditional distributions defining 
the sampling model. We note that, when aij is constant, a^jai = 1/s so that all 
transitions are supposed to be equally likely. Priors with this hyper-parameter 
specification are known as symmetric priors (Good, 1968). 

A Bayesian estimation of the probabilities Pij is the posterior expectation of 
Pij. By conjugate analysis (Ramoni & Sebastiani, 1999), the posterior distribu- 
tion of Q is still Hyper-Dirichlet with updated hyper-parameters a^ P Uij and 
the posterior expectation of pij is 



aij Tlij 
ai + Hi 



Equation 2 can be rewritten as 



(2) 



^ (y-i '^ij 

= — ( 3 ) 

ai ai + Ut Hi ai + Hi 

which shows that pij is an average of the estimate Uij jui and of the quantity 
aij I ai, with weights depending on ai and When Ui ^ ai, the estimate of pij 
is approximately Uijjui, and the effect of the prior is overcome by data. However, 
when Ui <C ai, the prior plays a role. In particular, the Bayesian estimate 2 is 
never 0 when = 0. 

Example 1. Table 1 reports the frequencies of transition i,j = 1, ...,5 obser- 
ved in the first time series in Figure 1 and the learned transition matrix when 
the prior global precision is a = 5 and aij = 1/5. The matrix P describes a 
dynamic process characterized by frequent transitions between states 1, 2 and 
3 while states 4 and 5 are visited rarely. Note that although the observed fre- 
quency table is sparse, as 14 transitions are never observed, null frequencies of 

^ A prior distribution is said to be conjugate when it has the same functional form as 
the likelihood function 
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Table 1. Frequency and transition matrices for the first time series in Figure 1. 





1 2 3 4 5 


1 2 3 4 5 


T 


3 12 3 0 3 1 


0.15 0.55 0.15 0.01 0.15 


2 


11 1 2 2 0 A 2 


0.66 0.07 0.13 0.13 0.01 


3 


6 0 0 0 1 ^3 


0.78 0.03 0.03 0.03 0.15 


4 


0 0 2 0 0 4 


0.07 0.07 0.73 0.07 0.07 


5 


0 4 0 0 0 5 


0.04 0.84 0.04 0.04 0.04 



some transitions do not induce null probabilities. The small number of transiti- 
ons observed from state 3 (na = 7), state 4 (u 4 = 2) and state 5 (ris = 4) do not 
rule out, for instance, the possibility of transitions from 3 to either 2, 3 or 4. A 
summary of the essential dynamics is in Figure 2 in which double headed paths 
represent mutual transitions. Transitions with probability smaller than 0.05 are 
not represented. 




Fig. 2. Markov Chain induced from data. 



2.2 Clustering 

The second step of the learning process is an unsupervised agglomerative clu- 
stering of time series on the basis of their dynamics. The available data is a set 
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S = {^i} of m time series and the task of the clustering algorithm is two-fold: 
finding the set of clusters that gives the best partition of the data and assigning 
each time series Si to one cluster, without fixing the number of clusters in ad- 
vance. 

Formally, clustering is done by regarding a partition as a discrete variable C 
with states Ci, . . . ,Cc that are not observed. Each state Ck of the variable C 
labels, in the same way, time series generated by the same MC with transition 
probability matrix Pk and, hence, it represents a cluster of time series. The 
number c of states of the variable C is unknown but it is bounded above by the 
total number of time series in the data set S\ The maximum number of clusters 
is obtained when each time series is generated by a different MC. Thus, initially, 
each time series has its own label. The clustering algorithm then tries to relabel 
those time series that are likely to have been generated by the same MC and 
thus merges the initial states Ci, . . . , Cm into a subset Ci, . . . , Cc, with c < m. 
Figure 3 provides an example of three different relabelings of a data set S of 
four time series. Each relabeling determines a model so that, for example, model 
Ml is characterized by the variable C having three states Ci, C2 and C3 with 
Cl labeling time series and S2, C2 labeling S3 and C3 labeling 54. In models 
M2 and M3, the variable C has only two states but they correspond to different 
labeling of the time series and hence different clusters. 

The specification of the number c of states of the variable C and the assig- 
nment of one of its states to each time series St define a statistical model M^ 
Thus, we can regard the clustering task as a Bayesian model selection problem, 
in which the model we are looking for is the most probable way of re-labeling 
time series, given the data. We denote by p{Mc) the prior probability of each 
model Me and then we use Bayes’ Theorem to compute its posterior probability, 
and we select the model with maximum posterior probability. The posterior pro- 
bability of Me, given the sample S, is p{Mc\S) oc p(Mc)p(SjMc) where p{S\Mc) 
is the marginal likelihood. This marginal likelihood differs from the likelihood 
function because it is only a function of the model M^, while the likelihood func- 
tion depends on parameters 6 quantifying the model Me. Therefore, the marginal 
likelihood is computed by averaging out the parameters from the likelihood func- 
tion. We show next that, under reasonable assumptions on the sample space, the 
adoption of a particular parameterization for the model Me and the specification 
of a conjugate prior lead to a simple, closed-form expression for the marginal 
likelihood p{S\Mc). 

Given a model Me, that is, a specification of the number of states of the 
variable C and of the labeling of the original time series (or, equivalently, con- 
ditional on the specification of c clusters of time series), we suppose that the 
marginal distribution of the variable C is multinomial, with cell probabilities 
Pk = p{C = Ck\0). We also suppose that the mcs generating the time series 
assigned to different clusters Ck are independent, given C, and that time se- 
ries generated by the same MC are independent. We denote by Pk = (pkij) the 
transition probability matrix of the MC generating time series in cluster Ck and 
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Fig. 3. Three models corresponding to different re-labeling of four time series. 



denote the cell probabilities by pkij = p{Xt\Xt-i = i,Ck,6)- Therefore, the 
overall likelihood function is 



p{s\o) = l[prl[plij' 

k —1 — 1 

where Ukij denotes the observed frequency of transitions (i — >■ j) observed in 
all time series assigned to cluster Ck, and irik is the number of time series that 
are assigned to cluster Ck- Compared to Equation 1, the likelihood is a pro- 
duct of terms corresponding to each cluster and each factor is weighted by the 
cluster probability p™*”- We now define a prior density for 0 as a product of 
c X s-b 1 Dirichlet densities. A Dirichlet is the prior distribution assigned to (pfe), 
say D{/3i, The other c x s densities correspond to c independent Hyper- 

Dirichlet distribution HD{akij)s, each distribution HD(oikij)s being assigned 
to the parameters pkij of the MC generating the time series in cluster Ck- The 
marginal likelihood is then given by 

p{S\M,) = J p{S\9)p{0)d0 

and it is easy to show (by using the same integration techniques in (Cooper & 
Herskovitz, 1992)) that 




Sequence Learning via Bayesian Clustering by Dynamics 



19 



(C\]\J \ TT + Wfc) TT r{aki) -r-r r{akij + rikij) 

r{(i + m)l}^ m) + r{a,,,) 

where F(-) denotes the Gamma function, Uki = ^ikij is the number of tran- 
sitions from state i observed in cluster Ck, j3 = 0k- 

Once the a posteriori most likely partition has been selected, the transition 
probability matrix Pk associated with the cluster Ck can be estimated as 

^ k^kij P rikij 

Clki P riki 

and the probability of C = Cfc can be estimated as 

j3k + rrik 



p -I- m 

We conclude this section by suggesting a choice of the hyper-parameters akij 
and l3k- We use symmetric prior distributions for all the transition probabilities 
considered at the beginning of the search process. The initial m x s x s hyper- 
parameters akij are set equal to a/{ms’^) and, when two time series are assigned 
to the same cluster and the corresponding observed frequencies of transitions are 
summed up, their hyper-parameters are summed up. Thus, the hyper-parameters 
of a cluster corresponding to the merging of ruk time series will be rrika/{ms'^). 
An alternative solution is to distribute the initial precision a uniformly, across 
clusters, so that the hyper-parameters of a model with c clusters are o;/(cs^). In 
both ways, the specification of the prior hyper-parameters requires only the prior 
global precision a, which measures the confidence in the prior model, and the 
marginal likelihood of different model is a function of the same a. An analogous 
procedure can be applied to the hyper-parameters /3fc associated with the prior 
estimates of pk- Empirical evaluations have shown that the magnitude of the 
a value has the effect of zooming out differences between dynamics of different 
time series, so that, increasing the value of a produces an increasing number of 
clusters. 

2.3 A Heuristic Search 

To implement the clustering method described in the previous section, we should 
search all possible partitions and return the one with maximum posterior pro- 
bability. Unfortunately, the number of possible partitions grows exponentially 
with the number of time series and a heuristic method is required to make the 
search feasible. 

A good heuristic search could be to merge, or agglomerate, first pairs of time 
series producing similar transition probability tables. What makes two tables 
similar? Recall that each row of a transition probability table corresponds to a 
probability distribution over states at time t given a state at time t — 1. Let P\ 
and P 2 be tables of transition probabilities of two MCs. Because each table is 
a collection of s row conditional probability distributions, rows with the same 
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index are probability distributions conditional on the same event. The measure of 
similarity that BCD uses is therefore an average of the Kulback-Liebler distances 
between row conditional distributions. Let puj and P 2 ij be the probabilities of 
the transition Xt = j\Xt-i = z in Pi and P 2 . The Kulback-Liebler distance of the 
two probability distributions in row i is D{pu,p 2 i) = ’^l^iPiij^og{puj /p 2 ij)- 
The average distance between Pi and P 2 is then D{Pi,P-^ = D{pu,p 2 i)/ s. 

Iteratively, BCD computes the set of pairwise distances between the transition 
probability tables, sorts the generated distances, merges the two time series with 
closest transition probability tables and evaluates the result. The evaluation asks 
whether the resulting model Me, in which two time series are assigned to the 
same cluster is more probable than the model Mg in which these time series are 
generated by different MCs, given the data S. If the probability p{Mc\S) is higher 
than p{Mg\S), BCD updates the set of transition probability tables by replacing 
them with the table resulting from their merging. Then, BCD updates the set of 
ordered distances by removing all the ordered pairs involving the merged MCs, 
and by adding the distances between the new MC and the remaining MCs in the 
set. The procedure repeats on the new set of MCs. If the probability p{Mc\S) 
is not higher than p{Mg\S), BCD tries to merge the second best, the third best, 
and so on, until the set of pairs is empty and, in this case, returns the most pro- 
bable partition found so far. The rationale of this search is that merging similar 
MCs first should result in better models and increase the posterior probability 
sooner thus improving the performance of the hill-climbing algorithm. Empirical 
evaluations in controlled experiments appear to support this intuition (Ramoni 
et ah, 1999). Note that the similarity measure is just used as a heuristic guide 
for the search process rather than a grouping criterion. 

3 Reasoning and Prediction with Clusters of Dynamics 

The BCD algorithm partitions a batch S' of to time series into c clusters. Each 
cluster Ck groups time series generated by the MC with transition probability 
Pk, which is estimated as {puj) = {akij + nkij)/i.aki + nu)- The grouping of 
time series into clusters provides estimates of the marginal probability that a 
future time series is generated from the MC with transition probability Pk- This 
probability is the quantity pk = (Pk + kUk) / {/3+m) . The probabilities pk and pkij 
can be used to recognize the cluster from which a new time series is generated 
by using Bayesian predictive-sequential inference (Cowell, Dawid, Lauritzen, & 
Spiegelhalter, 1999). 

Suppose we observe a transition (cco,a;i), with xq chosen, and we know that 
this transition can be generated only from one of the c MCs estimated from the 
c clusters, and we wish to decide which of the c MCs is more likely to be the 
generating model. Conditional on the transition (a;o,a;i), we can compute the 
probability that each of the c MCs is the generating process by using Bayes’ 
Theorem: 



p{Ck\xo,xi) 



p{xQ,xi\Ck)p{Ck) 

p{xo,xi) 
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The quantity p{Ck\xo, a;i) is the probability that the generating process is the MC 
in cluster Ck^ given the transition { xq , xi ), while p{x(i,xi\Ck) is the probability 
of observing the transition (a;o,Xi), given that the generating process is the MC 
in cluster Ck- Bayes’ Theorem lets us update the prior probability p{Ck) into the 
posterior probability p(Ck\xo, xi) via the updating ratio p(xo, xi\Ck)/p(xo, xi), 
and the posterior probability distribution over the clusters can be used to choose 
the MC which, most likely, generated the transition ( xq , xi ). 

We only need to compute the updating ratio and, hence, the conditional 
probability p{xq, Xi\Ck), for any Ck, and the marginal probability p{xq, Xi). This 
marginal probability is computed as 



p{xo,xi) = y^p{xQ,xi\Ck)p{Ck) 
k 

so that the crucial quantity to compute remains the conditional probability 
p{xo,Xi\Ck)- If the value Xo is chosen deterministically, the probability p(a;o, a;i|C’A;) 
is simply the transition probability p{xi\Ck,XQ), which, if xq = i and Xi = j, is 
Pkij, and the posterior probability of cluster Ck is 



p{Ck\xQ,Xi) 



p{xi\Ck,xo)p{Ck) ^ 

Y.kP{^ACk,xo)p{Ck) 



(4) 



The updating becomes slightly more complex when more than one transition is 
observed. Suppose, for example, we observe the sequence S = (xo,xi,X2) and, 
hence, the pair of transitions {xq,xi) and (xi,X2)- The posterior distribution 
over the clusters, conditional on S, is 



p{Ck\xo,Xi,X2) 



p{xQ,Xi,X2\Ck)p{Ck) 

p{xo,Xi,X 2 ) 



and the updating ratio is now p{xo,xi,X2\Ck)/p{xo,xi,X2)- As before, the mar- 
ginal probability p{xo,xi,X2) is '^f.p{xo,xi,X2\Ck)p{Ck) and the quantity to 
compute is p{xo,xi,X2\Ck)- This probability can be factorized as 



p{xo,Xi,X 2 \Ck) = p{xo,xi\Ck)p{x 2 \Ck,xo,xi) = p{xi\Ck,Xo)p{x 2 \Ck,Xi). (5) 

The first simplification p{xo,xi\Ck) = p{xi\Ck,xo) is the same used above. The 
second simplification p{x2\Ck,xo,xi) = p{x2\Ck,xi) is a consequence of the 
Markov assumption. Both probabilities are then given by the estimates pkij and 
Pkji if a;o = i, xi= j, and X2 = I 

Simplification of Equation 5 determines this expression for the marginal pro- 
bability p(xo,Xi,X2)'- 



p{xo,Xi,X 2 ) = ^p{xo,Xi,X 2 \Ck)p{Ck) = y^}p{xi\Ck, Xo)p{Ck)]p{x 2 \Ck ,Xl) 
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SO that the posterior probability p{Ck\xo, xi,X2) is given by 



p{Ck\xo,Xi,X2) 



[p{xi\Ck, Xo)p{Ck)]p{x 2 \Ck, Xi) 

T,k[P(^l\^k,Xo)p{Ck)]p{x2\Ck,Xi) 



Now, from Equation 4, we have that p(xi\Ck, xo)p{Ck) is proportional to p{Ck\xo,xi), 
with a proportionality constant independent of k. Therefore, we can rewrite 
p{Ck\xo,xi,X 2 ) above as 



p{Ck\xo,Xi,X2) 



p{x2\Ck, Xi)p{Ck\xQ, Xi) 
J2kP(^‘2\^k,Xi)p{Ck\xo,Xi) 



which is identical to formula 4 with p{Ck\xo, xi) playing the role of p{Ck)- This 
property is the core of the predictive-sequential approach: For any sequence 
S = (xo, xi, X 2 , ■ ' ')j the posterior distribution over the clusters can be computed 
sequentially, by using each transition (xt-i,xt) in turn to update the current 
prior distribution into a posterior distribution using formula (4). The posterior 
distribution will become the prior for the next updating. This result is used in 
the next section. 



4 Bayesian Clustering of Sensory Inputs 

This section begins with a description of the robot used in the experiments. It 
then proceeds by analyzing the application of BCD to the unsupervised generation 
of a representation of the robot’s experiences, and finally shows how to use this 
abstract representation to enable the robot to classify its current situation and 
predict its evolution. 




Fig. 4. The Pioneer 1 robot. 



4.1 The Robot and Its Sensors 

Our robot is the Pioneer 1 depicted in Figure 4. It is a small platform with two 
drive wheels and a trailing caster, and a two degree of freedom paddle gripper 
(the two metal arms coming out of the platform). For sensors, the Pioneer 1 
has shaft encoders, stall sensors, five forward pointing and two side pointing 
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sonars, bump sensors, and a pair of infra-red sensors at the front and back of 
its gripper. The bump sensors signal when the robot has touched an object so, 
for example, they go on when the robot pushes or bumps into something. The 
infra-red sensors at the front and back of the gripper signal when an object 
is within the grippers. The robot has also a simple vision system that reports 
the location and size of colored objects. Our configuration of the Pioneer 1 has 
roughly forty streams of sensor data, though the values returned by some are 
derived from others. 



Table 2. Robot’s sensors and their range of values. 



Sensor 


name Interpretation 


Range of values 


r.vel 


velocity of right wheel 


-600-600 


l.vel 


velocity of left wheel 


-600-600 


grip.f 


infra-red sensor at the 






front of the gripper 


0=off; l=on 


grip.r 


infra-red sensor at the 






rear of the gripper 


0=off; l=on 


grip.b 


bumper sensor 


0=off; l=on 


vis. a 


number of pixels of object 






in the visual field 


0-40,000 


vis.x 


horizontal location of object 


-140 = nearest 




in the visual field 


140= furthest 


vis.y 


position of object 


0 = most left. 




in the visual field 


256= most right 



We will focus attention on 8 sensors, described in Table 2. Figure 5 shows 
an example of sensor values recorded during 30 seconds of activity of the robot. 
The velocity-related sensors r.vel, l.vel, as well as the sensors of the vision system 
vis. a, vis.x and vis.y take continuous values and were discretized into 5 bins of 
equal length, labeled between .2 and 1. The vis. a sensor has a highly skewed 
distribution so that the square root of the original values were discretized. Hence, 
the category .2 for both sensors r.vel l.vel represents values between -600 and -340, 
while .4 represents values between -360 and -120, .6 represents values between 
-120 and 120 and so on. Negative values of the velocities of both wheels represent 
the robot moving backward, while positive velocities of both wheels represent 
forward movements. Both negative and positive velocities of the wheels result in 
the robot turning. The first two plots show sensory values of the left and right 
wheel velocity from which we can deduce that the robot is probably not moving 
during the first 5 seconds (steps 1 to 50) or moving slowly (the bin labeled 0.6 
represents range of velocity between -120 and 120). After the 5th second, the 
robot turns (the values of the velocity of the two wheels are discordant) stops 
and then moves forward, first at low velocity then at increasing velocity until 
stops, begins moving backward (as the velocity of both wheels is negative) and 
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then forward again. Note that the sensors grip.f and grip.b go on and stay on 
in the same time interval. Furthermore, the dynamic of the sensor vis. a shows 
the presence of an object of increasing and decreasing size in the visual field, 
with maximum size corresponding to the time in which the sensors of the wheel 
velocity record a change of trend. The trend of the other two sensors of the vision 
system, vis.x and vis.y, both support the idea that the robot is moving toward 
an object, bumps into it, and then moves away. 




0 50 100 ISO 200 250 300 0 50 100 150 200 250 300 

vis.x vis.y 

Fig. 5. Sensory inputs recorded by the robot during 30 seconds. The x-axes report the 
time, measured every 1/10 of a second. 



The robot is programmed to engage in several different activities, moving 
toward an object, loosing sight of an object, bumping into something, and all 
these activities will have different sensory signatures. If we regard the sequence 
of sensory inputs of the robot as a time series, different sensory signatures can be 
identified with different dynamic processes generating the series. It is important 
to the goals of our project that the robot’s learning should be unsupervised, 
which means we do not tell the robot when it has switched from one activity 
to another. Instead, we define a simple event marker — a simultaneous change 
of at least three sensors — and we define an episode as the time series between 
two consecutive event markers. The available data is then a set of episodes for 
each sensor and the statistical problem is to cluster episodes having the same 
dynamics. The next section will apply the BCD algorithm to solve this problem. 
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4.2 Clustering the Robot Sensory Inputs 

In this section we describe the results obtained with the BCD algorithm on a 
data set of 11,118 values recorded, for each of the 8 sensors in Table 2, during 
an experimental trial that lasted about 30 minutes. The event marker led us to 
split the original time series into 36 episodes, of average length 316 time steps. 
The shortest episode was 6 time steps long and the longest episode was 2917 
time steps. 



Table 3. Size of clusters found by the BCD algorithm for the sensor vis. a. Brackets 
before a pair of numbers indicate that the clusters were produced by splitting the 
episodes belonging to one cluster only for a smaller value of a. 



a 



Sensor 


1 5 


10 


20 


40 








J 


fll 






ri8 
1 16 


18 1 


1 ^ 


vis. a 


34 34 ■ 




1 10 








13 1 


1 3 



3 

2 2 2 2 2 



We ran our implementation of the BCD algorithm on the set of 36 episodes 
for each sensor, using different values for the precision a, while /3 was set equal 
to 1. Table 3 shows the number of clusters created by the BCD algorithm for the 
sensor vis. a, for some of the values of a used in this experiment. A small value 
of the precision a leads BCD to identify two clusters, one merging 34 episodes, 
the other merging two episodes. Increasing values of a make the BCD algorithm 
create an increasing number of clusters by monotonically splitting a single cluster 
of 34 episodes. For example, this cluster is split into two clusters of 18 and 16 
episodes, when a = 10. The cluster collecting 16 episodes is then split into two of 
13 and 3 episodes, for a = 20, and then, when a = 40, the cluster of 18 episodes 
is split into two clusters of 11 and 7 episodes, while the cluster of 13 episodes is 
split into two clusters of 10 and 3 episodes. Larger values of a make BCD create 
an even larger number of clusters, some of which represent MC learned from very 
sparse frequency matrices, so we decide to stop the algorithm for a = 40, in 
order to avoid overfitting. 

The pictures in Table 4 represent the essential dynamics of the MCs induced 
from the BCD algorithm with the 36 episodes of the sensor vis. a. Thus, neither 
transitions with probability inferior to 0.01 are represented in the chains nor 
are the uniform transition probabilities of visiting all states. We stress here that 
the interpretation of the dynamics represented by the MCs is our own one — we 
looked at the transition probability matrices and labeled the dynamics according 
to our knowledge of the robot’s perception system — and it is by no means 
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Table 4. MCs learned with the BCD algorithm from the 36 episodes of the sensor vis. a. 





Cluster 5: Object approaching. 



Cluster 6: No object in sight. 
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knowledge acquired by the robot. So, for example, the first chain represents a 
dynamic process concentrated on the first three states of the sensor vis. a. State 
0.2 represents the presence, in the robot’s visual field, of an object of size varying 
between 0 and 1600 pixels, state 0.4 represents the presence of an object of size 
between 1600 and 6400 pixels, while state 0.6 represents the presence of an object 
of size between 6400 and 14,400 pixels. The maximum size is given by 40,000 
pixels so that values between 0 and 14,400 represent an object that, at most, 
takes 1/4 of the visual field. Now the dynamics between these three states can 
be that either the sensor value is constant or decreases because it visits a state 
preceding itself, so that the overall dynamics is that of an object of decreasing 
size in the visual field that eventually disappears. The interpretation of the other 
dynamics was deduced in a similar way. Interestingly, the last chain represents 
essentially a deterministic process, in which the sensor value is constant in state 
0.2 showing that there is no object in the robot’s visual field. 

Similar results were found for the other sensors related to the robot’s vision 
system. For example, the values of the sensors vis.x and vis.y produce two clusters 
for a = 1 and eight clusters for a = 40. The values of the other sensors tend to 
produce a smaller number of clusters and to need a much higher prior precision to 
induce clusters: the minimum value of the precision to obtain at least 2 clusters 
was 20 for the sensor l.vel; 40 for the sensors r.vel, grip.f and grip.r; and it was 
45 for the sensor grip.b. We used the same approach described above to choose 
a value of a\ we run the BCD algorithm for increasing values of a and stopped 
the algorithm when it began producing MCs with very sparse frequency tables. 

We found three clusters of MCs for both the sensors l.vel and r.vel (a = 45 
and 50), representing dynamics concentrated on null or negative values of the 
velocity, null or positive values of the velocity and a mixture of those. We found 
two clusters for the dynamics of sensor grip.f {a = 40), the first one representing 
a process in which the gripper front beam stays off with high probability and 
with small probability goes on and stays on, while, in the second one, there is 
a larger probability of changing from the off to the on state. Hence, the second 
cluster represents more frequent encounters with an object. The episodes for 
the sensor grip.r were partitioned in three clusters (a = 40), one representing 
rapid changes from the on to the off state, followed by a large probability of 
staying off; one representing rare changes from the off to the on state, or the 
other way round, followed by a large probability of staying in that state; the last 
one representing the sensor in the on state. The episodes for the sensor grip.b 
were partitioned in two clusters, one representing rare changes from the off to 
the on state, or the other way round, followed by a large probability of staying in 
that state; the last one representing the sensor in the on state. So, for example, 
the first cluster represents the sensor dynamics when the robot is not near an 
object but, when it is, it pushes it for some time. The second cluster is the sensor 
dynamics when the robot is pushing an object. Finally, the three clusters found 
for the sensor vis.x (a = 5) discriminate the sensor dynamics when an object is 
far from the robot, or near the robot or a mixture of the two. The four clusters 
of dynamics for the sensor vis.y (a = 1) distinguish among an object moving 
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from the left of the robot’s visual field to the center, from the left to the right, 
from the right to the left, or in front of the robot. 

The interpretation of the dynamic processes, represented by the clusters that 
the BCD algorithms found for each sensor, is our own and not the robot’s one. An 
interesting and still open question is what the robot learned from these clusters. 
The clusters found by the BCD algorithm assign a label to each episode so that, 
after this initial cluster analysis, the robot can replace each episode with a label 
representing a combination of 8 sensor clusters. Now, episodes labeled with the 
same combination of sensor clusters represent the same “activity” characterized 
by the same sensor dynamic signature. For example, one such activity is cha- 
racterized by the combination cluster 1 for r.vel, cluster 3 for l.vel, cluster 1 for 
grip.f, grip.r and grip.b, cluster 2 for vis. a and vis.x and cluster 1 for vis.y. This 
activity is repeated in 7 of the 36 episodes. Using our interpretation of the dy- 
namics represented by the clusters, we can deduce that this activity represents 
the robot that rotates and moves far from an object (the velocity of the wheels 
are discordant, and the size of the object in the visual field decreases and be- 
comes null) and hence we have a confirmation that this activity is meaningful. 
However, as far as the robot’s world in concerned, this activity is nothing more 
than a combination of sensory dynamics. 

This process of labeling the episodes in activities by replacing each sensor 
episode by the cluster membership reduces the initial 36 episodes into 22 different 
activities, some of which are experienced more than once. Thus, the robot learned 
22 different activities characterized by different dynamic signatures. 



4.3 Reasoning with Clusters of Dynamics 

Results of Section 3 can be used to provide the robot with tools to recognize the 
cluster it is in, given sensor data. We consider here the six clusters of dynamics 
learned for the vis. a sensor. From the data in Table 3, one can see that the 
probability distribution over the 6 clusters in Figure 4 is 



Ck 


Oi 02 O3 O4 O5 L/g 


Pk 


0.19 0.30 0.06 0.09 0.09 0.27 



This probability distribution is estimated using = {Pk + 'nT^k)/{P + in) with 
P = 1 and Pk = 1/6. Each cluster, in turns, represent a MC with transition 
probability learned from the time series merged in the cluster. For example, the 
transition probability of the MC in cluster Ci is the table 





0.2 


0.4 


0.6 


0.8 


1 


0.2 


1.00 


0.00 


0.00 


0.00 


0.00 


0.4 


0.52 


0.12 


0.12 


0.12 


0.12 


0.6 


0.02 


0.10 


0.82 


0.02 


0.02 


00 

0 


0.20 


0.20 


0.20 


0.20 


0.20 


1 


0.20 


0.20 


0.20 


0.20 


0.20 
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Suppose now we observe the sequence S = (0.2, 0.2, 0.4, 0.6). The posterior dis- 
tribution over the clusters Ck, conditional on the first transition (0.2, 0.2) is com- 
puted using formula 4. Since pkii is 1.00 for fc = 1, 2, 6, 0.20 for k = 3, and 0.43 
for fc = 4, 5, there follows that p(0.2, 0.2) = 0.83 and p(0.2, 0.2|C'i)p(C'i) = 0.194; 
p(0.2,0.2|C'2)p(C' 2) = 0.301; p(0.2, 0.2|C3)p(C'3) = 0.012; p(0.2, 0.2|C'4)p(C'4) = 
0.036; p( 0.2,0.2|C'5 )p(C 5) = 0.017; and p(0.2, 0.2|C'6)p(C'6) = 0.271. Formula 4 
gives the posterior distribution over the clusters 



Ck 


Oi L /2 O 3 O 4 O 5 Og 


p(Cfc|0.2,0.2) 


0.24 0.36 0.01 0.04 0.02 0.33 



This distribution, compared to those learned from the data, assigns more weight 
to clusters Ci, C 2 and Cq and the most likely cluster generating transition 
(0.2, 0.2) appears to be € 2 - When we use this updated distribution over the 
clusters as prior for the next updating, conditional on transition (0.2, 0.4), the 
distribution turns out to be 



Ck 


Ol O2 O3 O4 O5 LyQ 


p(Cfe|0.2,0.2,0.4) 


0.00 0.04 0.10 0.67 0.15 0.04 



and the most likely cluster is C 4 . Indeed, from Figure 4, we see that cluster C4 
assigns high probability to the transition (0.2, 0.4). The next updating produces 



Ck 


Oi O2 O3 O4 O5 06 


p(Cfc|0.2,0.2,0.4,0.6) 


0.00 0.00 0.61 0.08 0.24 0.06 



so that likely generating clusters are C3 and C5 . Note that cluster C3 is the most 
likely and indeed it is the only cluster that gives high probability to the sequence 
(0.2, 0.2, 0.4, 0.6). 

This procedure can be used to reason with longer sequences so that the robot 
can recognize the cluster it is in and use this to make decisions. 

5 Future and Related Work 

Although the description of BCD in this paper assumes a batch of univariate 
time series for reasons of simplicity, we have extended it to the multivariate 
case. Suppose one has a multivariate time series of k variable each of which 
takes, say, v values. A first solution is trivial and consists of recoding these 
series into a single one, as it was generated by a variable taking v^ values, one 
for each combination of values of the original variables. The difficulty here is 
computational: the transition probability table for the encoding variable will 
have (u*)^ probabilities to estimate. A more sophisticated approach relies on 
some assumptions of conditional independence among the variable dynamics so 
that each cluster becomes a set of MCs with independent transition probability 
tables (Ramoni, Sebastian!, & Cohen, 2000). 

BCD models time series as first order MCs. More complex models involve the 
use of k-order Markov chains (Saul & Jordan, 1999), in which the memory of the 
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Xt 



Fig. 6. Graphical representation of the model used by BCD. 



time series is extended to a window of k time steps, or Hidden Markov Models 
(MacDonald & Zucchini, 1997), in which hidden variables H are introduced to 
decompose the complex auto-regressive structure of the time series into smaller 
pieces. Hidden Markov Models were originally introduced in speech recognition 
(Rabiner, 1989) and are nowadays applied in many fields ranging from DNA 
and protein sequencing (Lio & Goldman, 1998) to robotics (Firoiu & Cohen, 
1999). Despite their popularity, Hidden Markov Models make assumptions that 
are questionable on the basis of recent results about the identifiability of hidden 
variables (Settimi & Smith, 1998) because identifiability of the hidden variables 
may impose strong constraints on the auto-regressive structure of the series. 
At first glance it may appear that BCD and Hidden Markov Models are similar 
technologies, and indeed we have used Hidden Markov Models for some robot 
learning tasks (Firoiu, Oates, & Cohen, 1998), but they are quite different. BCD 
approximates the process generating time series with MCs and clusters similar 
MCs by creating a variable C which represents cluster membership. Conditional 
on each state of the variable C, the model for the time series is a MC with 
transition probability p{Xt = j\Xt_i = i,Ck)- Graphically, this assumption is 
represented by the model in Figure 6. The oval represents a MC and C separates 
different ovals, so that, conditional on C = Cfc we have p{Xt = j\Xt-i = i, Ct) 
and this is independent of other MCs. 

In a Hidden Markov Model, the hidden variable H (or variables) allows one to 
compute the transition probabilities among states as p{xt\xt-i-, H^) = p{xt\Hk), 
so that conditioning on the state of the hidden variable makes the dependence 
of Xt on Xt-i vanish. Figure 7 represents graphically this assumption. A detailed 
explanation of the difference between a Hidden Markov Model and the model 
used by BCD can be found in (Smyth, 1997) and the description of an algorithm 
to cluster Hidden Markov Model is reported by Smyth (Smyth, 1999). 

The Bayesian modeling-based approach used in BCD is similar in nature to 
the Bayesian clustering methods developed by Raftery (Banfield & Raftery, 1993; 
Fraley & Raftery, 1998) and Cheeseman (Cheeseman & Stutz, 1996) for static 
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Fig. 7. Graphical representation of a Hidden Markov Model. 



data. Recent work (Ridgeway, 1998; Smyth, 1999) attempted to extend the idea 
to dynamic processes without, however, succeeding in finding a closed form so- 
lution, as the one we have identified. Ridgway (Ridgeway, 1998) proposed to ap- 
proximate the marginal likelihood using Markov Chain Monte Carlo methods. 
Smyth (Smyth, 1999) represents a clustering of time series as a mixture of a 
known number of components and estimates the parameters of the mixture via 
the EM algorithm. Thus, in his approach, clustering reduces to an estimation 
problem, while BCD seeks the best clustering as solution of a model selection 
problem. Furthermore, all these methods assume the number of clusters to be 
known, while BCD searches for the number of clusters with the highest posterior 
probability, and relies on an heuristic method to efficiently explore the space of 
possible partitions. 

Methodology aside, bcd is similar in some respects to some other algorithms 
for clustering time series. To assess the dissimilarity of a pair of multivariate, 
real-valued time series, Oates (Oates et ah, 1999) applies dynamic time warping 
to force a series to fit the other as well as possible; the residual lack of fit is 
a measure of dissimilarity, which is then used to cluster episodes. Rosenstein 
(Rosenstein & Cohen, 1998) solves the problem by first detecting events in time 
series, then measuring the root mean squared difference between values in two 
series in a window around an event. In Rosenstein’s method, two time series are 
compared moment by moment for a fixed interval. In Oates’s approach, a series 
is stretched and compressed within intervals to make it fit the other as well as 
possible. The former method keeps time rigid, the latter makes time elastic. If 
the duration of a sequence within a series is important to the identity of the 
series — if clustering should respect durations — the former method is probably 
preferable to the latter. BCD is even more extreme because it transforms a time 
series, in which durations might be important, into a table of state transitions, 
which is inherently atemporal. For instance, one cannot tell by looking at a 
transition table whether a transition Xt = j\Xt-i = i occurred before or after a 
transition Xt> = j\Xt’-i = i. 
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The problem of finding dependencies between states in time series has been 
studied by several authors (Howe, 1992; Oates & Cohen, 1996; Friedman, Mur- 
phy, & Russell, 1998). The current work is unique in its approach to clustering 
time series by the dependency structure that holds among states, and in par- 
ticular, its ability to differentiate time series that have different dependency 
structures. 

6 Conclusions 

This chapter presented BCD a model-based Bayesian algorithm to cluster time 
series generated by the same process. Although in this chapter we applied BCD 
to cluster time series of robot’s sensory values, the algorithm has been applied 
successfully to identify prototypical dynamics in simulated war games, music 
composition and tracking of financial indexes. Our conjecture for this success 
is that, in those applications, approximating the true generating process by a 
first order MC was enough to capture the essential dynamics thus providing the 
algorithm with the information needed to partition the original set of time series 
into groups. However, approximating a dynamic process by a MC has limitations: 
for example the order with which transitions are observed in a time series is lost 
and this may be a serious loss of information. The results given in Section 3 
can be used to assess the adequacy of the Markov assumption, by using clusters 
found by the algorithm to solve some classification or prediction task. 
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1 Introduction 

Given a source of time series data, such as the stock market or the monitors 
in an intensive care unit, there is often utility in determining whether there are 
qualitatively different regimes in the data and in characterizing those regimes. 
For example, one might like to know whether the various indicators of a patient’s 
health measured over time are being produced by a patient who is likely to live 
or one that is likely to die. In this case, there is a priori knowledge of the number 
of regimes that exist in the data (two), and the regime to which any given time 
series belongs can be determined post hoc (by simply noting whether the patient 
lived or died). However, these two pieces of information are not always present. 

Consider a system that produces multivariate, real- valued time series by sel- 
ecting one of K hidden Markov models (HMMs), generating data according to 
that model, and periodically transitioning to a different HMM. Only the time 
series produced by the system are observable. In particular, K and the identity 
of the HMM generating any given time series are not observable. Given a set of 
time series produced by such a system, this paper presents a method for auto- 
matically determining K, the number of generating HMMs, and for learning the 
parameters of those HMMs. 

HMMs are an attractive framework for modeling time series for two reasons. 
First, there are simple and efficient algorithms for inducing HMMs from time 
series data. Second, they have been demonstrated empirically to be capable of 
modeling the structure of the generative processes underlying a wide variety of 
real-world time series. HMMs have met with success in domains such as speech 
recognition (Jelinek 1997), computational molecular biology (Baldi et al. 1994), 
and gesture recognition (Bregler 1997). 

Given a set of time series, it is possible to induce an HMM that models 
these data. However, algorithms for inducing HMMs do not directly address the 
problem of identifying qualitatively different regimes in the data; they simply 
attempt to fit a single model that accounts for all of the data as well as possible, 
regardless of whether the data were generated by a single underlying process or 
multiple processes. 

Our approach to addressing the above limitation of HMMs involves obtai- 
ning an initial estimate of K (the number of regimes in the time series) by 
unsupervised clustering of the time series using dynamic time warping (DTW) 
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as a similarity metric. In addition to producing an estimate of K, this process 
yields an initial partitioning of the data. As later sections will explain, DTW is 
related to HMM training algorithms but is weaker in several respects. Therefore, 
the clusters based on DTW are likely to contain mistakes. These initial clusters 
serve as input to a process that trains one HMM on each cluster and iteratively 
moves time series between clusters based on their likelihoods given the various 
HMMs. Ultimately the process converges to a final clustering of the data and a 
generative model (the HMMs) for each of the clusters. 

The remainder of the paper is organized as follows. Section 2 reviews HMMs 
and the algorithms used to induce HMMs from data. One such algorithm, the 
Viterbi algorithm, is explored in some detail. Section 3 describes dynamic time 
warping and its use as a distance measure between multivariate time series for 
the purpose of unsupervised clustering. Section 4 shows that despite surface 
dissimilarities between DTW and the Viterbi algorithm, under certain restricted 
conditions both algorithms optimize the same criterion. Section 5 describes an 
algorithm that takes advantage of the tight relationship between DTW and the 
Viterbi algorithm by using DTW to bootstrap the process of fitting HMMs to 
data containing multiple regimes. Section 6 presents the results of experiments 
with this approach using artificial data, and section 7 concludes and points to 
future work. 



2 Hidden Markov Models 

Sources of time series data abound. For example, the world’s financial markets 
produce volumes of such data on a daily, hourly, minute-by-minute and even 
second-by-second basis. Investors attempt to understand the processes underly- 
ing these markets and to make predictions about the directions in which they 
will move by analyzing time series of the prices of individual stocks, market 
averages, interest rates, exchange rates, etc. 

Of fundamental importance to the endeavor of time series analysis - re- 
gardless of whether the series are generated by financial markets, monitors in a 
nuclear power plant, or the activities of customers at a web site - is the construc- 
tion of models from data. In this context, a model is a concise representation 
of the data in a time series that ideally lays bare the structure in the data 
and, therefore, provides information about the real-world process underlying its 
generation. 

One type of model that has proven to be useful when working with time series 
data from many different kinds sources is the hidden Markov model (HMM) . It 
is possible to model time series containing either discrete values or real values 
within the HMM framework. In the former case a discrete HMM is used, and in 
the latter case a continuous HMM is used. 

We begin our discussion of HMMs with a formal specification of the elements 
of discrete HMMs, following the presentation in (Rabiner 1989) (an excellent 
tutorial to which the interested reader is referred), and then generalize to the 
continuous case. This will include a brief discussion of various algorithms for. 
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among other things, inducing HMMs from data. Following this background ma- 
terial, one of these algorithms (the Viterbi algorithm) will be explored in more 
detail in preparation for a comparison with DTW in section 4. 



2.1 HMM Fundamentals 

The following items comprise a discrete HMM: 

— a set of N states, N} 

— an alphabet of M output symbols, S = {ai , . . . , ctm} 

— a set of output probabilities, B = {bij\l < i < N,1 < j < M} 

— a set of state transition probabilities, A{aij\l < z < iV, 1 < j < N} 

— an initial state probability distribution, tt = {pi, . . . ,pn} 

A time series of values is generated by a discrete HMM as follows. At every time 
step the HMM is in one of its N states. The state in which the HMM begins 
at the first time step is chosen according to the initial state distribution tt. An 
element of S is then selected according to the output distribution of the current 
state and that symbol is emitted. Next, the HMM transitions to a new state 
based on the transition distribution of the current state. Finally, this process of 
selecting an output symbol and transitioning to a new state repeats. 



N = 2 5:={X,Y,Z} 

M = 3 n= {0.9,0.!} 
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Fig. 1. An example of a discrete Hidden Markov model. 



To make these ideas and the related notation clear, consider the discrete 
HMM shown in figure 1. This HMM has N = 2 states. The initial state dis- 
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tribution, tt = {0.9, 0.1}, indicates that the HMM will begin in state 1 with 
probability 0.9 and in state 2 with probability 0.1. Note that = 1- 

At each time step, one of the M = 3 symbols in A = (A, Y, Z} will be chosen 
and emitted. The probability with which a given element of E will be chosen 
depends on the set of output probabilities: 

B = {h,j\l <i<N,l<j<M} 

The value of bij is the probability that the symbol in E (i.e. aj) will be 
emitted in state i. Note that the output probabilities for each state must sum to 
one, i.e. = 1- For the HMM in the figure, bi^i is the probability that 

( 7 i, which is X, will be emitted in state 1. This probability is 0.8. Likewise, the 
probability of emitting <T 2 = M in state 1 is 61,2 = 0.15, and the probability of 
emitting 0-3 = Z in state 2 is 62,3 = 0.34. Clearly, time series generated by the 
HMM in figure 1 will be elements of E*, i.e. the Kleene closure over E. 

Once an output symbol is chosen and emitted, the HMM makes a transition 
to a new state selected according to the set of state transition probabilities: 

A= {aij\l <i<N,l<j<N} 

The value of Uij is the probability of transitioning to state j given that the 
current state is i. Note that the sum of the transition probabilities for each state 
must sum to 1 , i.e. = 1 - For example, the probability of transitioning 

from state 1 to state 2 is 01.2 = 0.25, whereas the probability of transitioning 
from state 2 to itself is 02,2 = 0.5. 



N = 2 n= {0.9,0.!} 
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Fig. 2. An example of a continuous Hidden Markov model. 



A continuous HMM differs from a discrete one in that the output symbols 
are emitted from a probability density instead of a distribution. The most com- 
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monly used density is the normal density with mean /i and standard deviation 
a. Consider the continuous HMM shown in figure 2. It is essentially the HMM 
shown in figure 1, except there is no discrete alphabet 27, and the set of output 
probabilities, B, has been replaced with two normal densities, one for each state. 
When the HMM is in state 1, a real value chosen from the normal density with 
mean = 2 and standard deviation cti = 1 will be emitted. When the HMM is 
in state 2, a real value chosen from the normal density with mean /X 2 = 8 and 
standard deviation (Ji = 3 will be emitted. A time series containing 100 time 
steps generated by this HMM is shown in figure 3. 




Fig. 3. A time series generated by the HMM in figure 2. 



Given a time series generated by an HMM, the activity of the HMM is obser- 
ved indirectly through the sequence of outputs. Therefore, the states are said to 
be hidden. Hidden Markov models are an extension of standard Markov models 
in which the output at each state is simply the index of the state, so an output 
sequence identifies the state sequence that produced it. It is both possible and 
common to extend the definition of HMMs to allow for the emission of a vector 
of discrete or real values at each time step rather than just a single value as 
described above. 

Let A be an HMM (either discrete or continuous), and let O = {oi, . . . ,ol} 
be a sequence of observations (i.e. a time series) gathered over L time steps. 
There exist dynamic programming algorithms to solve the following problems: 

1 determine the probability of observing O given that it was generated by A, 
i.e. p{0\X) 

2 find the sequence of states through A that maximizes the probability of O 
(the Viterbi algorithm) 

3 determine the parameters of A (e.g. the state transition and emission proba- 
bilities) that maximize the probability of O (the Baum- Welch algorithm) 

The solution to the first problem allows us to evaluate how well a given model 
matches a given observation sequence, and therefore to choose among multiple 
competing models the one that best “fits” the data. The solution to the second 
problem makes it possible to uncover that which is hidden, the sequence of states 
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that is most likely given the data. If it is assumed that the data were generated 
by an HMM, the solution to problem 3 makes it possible to find the parameters 
of the HMM that maximizes the probability of the data. An important caveat is 
that this maximization is not done globally with respect to all possible parameter 
values. Rather, it occurs over a local neighborhood of initial parameters values 
generated by the Baum- Welch algorithm. 

2.2 The Viterbi Algorithm 

Let O = {oi, . . . ,ol} be an observation sequence containing L values, and let 
S = {si, . . . , Si} be a sequence of L states through HMM A. Let p{0\S, A) be 
the probability of observing O given state sequence S. The Viterbi algorithm 
uses dynamic programming to find a state sequence out of the possible 
state sequences of length L that maximizes p{0\S). (We will drop the A when 
its identity is either clear from the context or unimportant.) Note that there 
may be more than one state sequence that maximizes the probability of the 
observation sequence. 

Given a state sequence and an observation sequence, how does one compute 
p{0\S)7 This is best explained via an example. Let A be the discrete HMM 
shown in figure 1. Let O = {X, A, Z} and let S = {1, 2, 1}. For O to occur given 
S, the HMM must begin in state 1, which occurs with probability 7 t(1 ) = 0.9, 
and emit an A, which occurs with probability = 0.8. The HMM must 
then transition to state 2 (with probability 01^2 = 0.25) and emit an A (with 
probability &2,i = 0.33). Finally, the HMM must transition back from state 2 to 
state 1 (with probability 02,1 = 0.5) and emit a Z (with probability 61^3 = 0.05). 
Because all of these various events are independent, the final probability is the 
product of the individual probabilities: 

p({A, A, Z}\{1, 2, 1}) = p({oi, 02, 03}|{si, S2, S3}) 

= boi ,Sidsi,S2 bo2 ,S2 ,«3 ^ 03 , S 3 

= ( 1 ) 

For arbitrary observation and state sequences of length L, equation 1 generalizes 
as follows: 



p(0|S') — 7Tsj6oi,si^t=2®st-l,st^ot,St (2) 

We can simplify the notation of equation 2 by defining a special state 0 such 
that the probability of transition from state 0 to state i is and the probability 
of transitioning into state 0 is 0 (i.e. ooy = tt^ and Gift = 0). If the HMM starts 
in state 0 with probability 1, then state 0 serves the function of the initial state 
probability distribution. We have not changed the output behavior of the HMM, 
but equation 2 can now be rewritten as follows: 

p(0|S') = 

= n^=ip{st\st-i)p{ot\st) 



( 3 ) 
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For the sake of clarity in the remaining discussion, as^^st-i has been replaced 
with p(st|st_i) and bot,st has been replaced with p{ot\st). The Viterbi algorithm 
returns a state sequence S that maximizes equation 3. 

It will prove to be useful later to note that any state sequence that maximizes 
3 also maximizes ln(p(0|S')): 

ln(p(0|A)) = ln(7Tt^ip(st|st-i)p(ot|st)) 

L 

L L 

= ^ln(p(st|st_i)) +^ln(p(ot|st)) (4) 

i=l t=l 



For continuous HMMs in which the output in state i is drawn from a normal 
distribution with mean pi and standard deviation ai, equation 4 simplifies as 
follows: 



ln(p(0|A)) = ^ln(p(st|st-i)) + ^ln(p(ot|st)) 



L 



L 



^ln(p(st|st_i)) + ^ln ( e 

t=l \v27TCTi 



1 

-L ^ — 



L 



= ^ln(p(st|st_i)) + E ln(l) — ln('\/^f7j) 4- In j e 






L 






= “ v^^ln(cTi) - ^ 



{ot - fit)" 

2a? 



( 5 ) 



We will return to equation 5 in section 4 where we formally specify the relati- 
onship between the Viterbi algorithm and DTW. 



3 Dynamic Time Warping 

This section describes dynamic time warping and its use as a measure of simi- 
larity for unsupervised clustering of time series. Recall that our ultimate goal is 
to discover qualitatively different regimes in time series data and to identify the 
individual HMMs that underly them. Given a suitable measure of similarity, we 
can discover qualitatively different regimes by clustering the time series. Each 
cluster should then correspond to a regime. DTW is such a measure of simila- 
rity. We begin with an explanation that appeals to intuition of DTW and how 
it is used in unsupervised clustering of time series, and then describe the algo- 
rithm more formally in preparation for a comparison of DTW and the Viterbi 
algorithm in section 4. 




42 



T. Oates, L. Firoiu, and P.R. Cohen 



3.1 The Intuition Behind DTW and Clustering 

Using the notation of section 2, let O denote a multivariate time series spanning 
L time steps such that O = {oi, ... ,ol}- The ot are vectors of values containing 
one element for the value of each of the component univariate time series at 
time i. Given a set of m multivariate time series, we want to obtain, in an 
unsupervised manner, a partition of these time series into subsets such that 
each subset corresponds to a qualitatively different regime. 

If an appropriate measure of the similarity of two time series is available, 
clustering followed by prototype extraction is a suitable unsupervised learning 
method for this problem. Finding such a measure of similarity is difficult because 
time series that are qualitatively the same may be quantitatively different in at 
least two ways. First, they may be of different lengths, making it difficult or 
impossible to embed the time series in a metric space and use, for example, 
Euclidean distance to determine similarity. Second, within a single time series, 
the rate at which progress is made can vary non-linearly. The same pattern may 
evolve slowly at first and then speed up, or it may begin quickly and then slow 
down. Such differences in rate make similarity measures such as cross-correlation 
unusable. 

DTW is a generalization of classical algorithms for comparing discrete se- 
quences (e.g. minimum string edit distance (Gorman, Leiserson, & Rivest 1990)) 
to sequences of continuous values (Sankoff & Kruskall 1983). It was used exten- 
sively in speech recognition, a domain in which the time series are notoriously 
complex and noisy, until the advent of Hidden Markov Models which offered a 
unified probabilistic framework for the entire recognition process (Jelinek 1997). 




Fig. 4. Two time series, Oi and O 2 , (the leftmost column) and two possible warpings 
of Oi into O 2 (the middle and rightmost columns). 



Given two time series, Oi and O 2 , DTW finds the warping of the time di- 
mension in Oi that minimizes the difference between the two series. Gonsider 
the two univariate time series shown in Figure 4. Imagine that the time axis of 
Oi is an elastic string, and that you can grab that string at any point corre- 
sponding to a time at which a value was recorded for the time series. Warping 
of the time dimension consists of grabbing one of those points and moving it to 
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a new location. As the point moves, the elastic string (the time dimension) com- 
presses in the direction of motion and expands in the other direction. Consider 
the middle column in Figure 4. Moving the point at the third time step from its 
original location to the seventh time step causes all of the points to its right to 
compress into the remaining available space, and all of the points to its left to 
fill the newly created space. Of course, much more complicated warpings of the 
time dimension are possible, as with the third column in Figure 4 in which four 
points are moved. 

Given a warping of the time dimension in Oi , yielding a time series that we 
will denote 0{, one can compare the similarity of 0{ and O 2 by determining, 
for example, the area between the two curves. That area is shown in gray in the 
bottom row of Figure 4. Note that the first warping of Oi in which a single point 
was moved results in a poor match, one with a large area between the curves. 
However, the fit given by the second, more complex warping is quite good. In 
general, there are exponentially many ways to warp the time dimension of Oi. 
DTW uses dynamic programming to find the warping that minimizes the area 
between the curve in time that is a low order polynomial of the lengths of Oi 
and O 2 , i.e. 0 (| 0 i|| 02 |). 

DTW returns the optimal warping of Oi, the one that minimizes the area 
between and O 2 , and the area associated with that warping. The area is used 
as a measure of similarity between the two time series. Note that this measure of 
similarity handles nonlinearities in the rates at which time series progress and is 
not affected by differences in the lengths of the time series. In general, the area 
between 0{ and O 2 may not be the same as the area between O 2 into Oi. Ho- 
wever, a symmetrized version of DTW exists that essentially computes the aver- 
age of those two areas based on a single warping (Kruskall & Liberman 1983). 
Although a straightforward implementation of DTW is more expensive than 
computing Euclidean distance or cross-correlation, there are numerous speedups 
that both improve the properties of DTW as a distance metric and make its 
computation nearly linear in the length of the time series with a small constant. 

Given m time series, we can construct a complete pairwise distance matrix 
by invoking DTW O(m^) times. We then apply a standard hierarchical, agglo- 
merative clustering algorithm that starts with one cluster for each time series 
and merges the pair of clusters with the minimum average intercluster distance 
(Everitt 1993). Without a stopping criterion, merging will continue until there 
is a single cluster containing all m time series. To avoid that situation, we do not 
merge clusters for which the mean intercluster distance is significantly different 
from the mean intracluster distance as measured by a t-test. The number of 
clusters remaining when this process terminates is K, the number of regimes in 
the time series. 

Finally, for each cluster we select a prototype. Two methods commonly used 
are to choose the cluster member that minimizes the distance to all other mem- 
bers of the cluster, and to simply average the members of the cluster. The ad- 
vantage of the latter method is that it smooths out noise that may be present 
in any individual data item. Unfortunately, it is only workable when the cluster 
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elements are embedded in a metric space (e.g. Cartesian space). Although we 
cannot embed time series in a metric space, DTW allows us to use a combination 
of the two methods as follows. First, we select the time series that minimizes di- 
stance to all other time series in a given cluster. Then we warp all other patterns 
into that centroid, resulting in a set of patterns that are all on the same time 
scale. It is then a simple matter to take the average value at each time point 
over all of the series and use the result as the cluster prototype. 

3.2 A More Rigorous Discussion of DTW 

This section describes DTW more formally than the previous one. There are 
two reasons for this. The primary motivation is the development of ideas and 
notation that will make it possible in section 4 to explicitly state the similarities 
between the Viterbi algorithm and DTW. A secondary motivation is simply to 
present a more conventional account of DTW. The use of areas between warped 
time series as the measure to be minimized is somewhat non-standard, whereas 
the use of the sum of Euclidean distances between points (after warping) is rather 
common. The discussion in this section will focus on the latter measure. 

Let W and Y be multivariate, real-valued time series of length L\y and Ly, 
respectively, such that: 



W = WiW2 . ..lULw 

Y = yiy2--.yLY 

Given ty € {1, 2, . . . , Ly} and G {1,2,..., Lw}, let </> be a warping function 
that maps elements of W onto elements of Y as follows: 

0(^u>) — ty 

That is, for any given element in W whose index is (f> returns the index in Y 
to which that element is mapped, i.e. ty. 

Let c? be a distance measure, such as Euclidean distance. Then DTW uses 
dynamic programming to determine (j) such that the following sum is minimized: 

Lw 

D^= d{wt^,y^(t„)) ( 6 ) 

It is common for d to compute the squared difference. When W and Y are 
univariate, this yields: 

Lw 

~2/0(t„))^ (7) 

t„=i 

Note that this is rather different from using the area between Y and the warped 
version of W as the measure that is to be minimized. For example, it is possible 
that (j) maps all points in W to a single point in Y , in which case that warped 
version of W spans 0 time steps and is simply the sum of the distances 
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between all of the points in W and the single point in V to which they are 
mapped. 

In practice, certain constraints are typically placed on (f> to avoid such situa- 
tions. Those commonly reported in the DTW literature are: 

— Boundary Conditions: This constraint forces the first element of W to 
map to the first element of Y (i.e. ^(1) = 1) and the last element of W to 
the last element of Y (i.e. (j){Ly[r) = Ly)- 

— Monotonicity: This constraint ensures that the warping maintains the tem- 
poral ordering of points in W (though clearly not the temporal interval bet- 
ween them), i.e. > 4>itj) for ti > tj. 

— Continuity: Under this rubric falls a large number of constraints that limit, 
for example, the degree to which the time dimension can be warped. 



4 A Unified View of HMMs and DTW 

Given two time series W and Y, dynamic time warping finds a mapping of points 
in W to points in Y such that the difference between the warped version of W and 
Y is minimized. In contrast, given a time series W and an HMM A, the Viterbi 
algorithm finds the state sequence S through A for which p(0\S) is maximized. 
On the face of it, these two algorithms are quite different. DTW operates on two 
time series and the Viterbi algorithm operates on a time series and an HMM. 
DTW minimizes a measure of dissimilarity and the Viterbi algorithm maximizes 
a probability. 

Despite the apparent differences between DTW and the Viterbi algorithm, 
this section will show that under certain circumstances these two algorithms 
are in fact optimizing the same criterion. Given two time series W and Y , it is 
possible to construct an HMM based on Y such that the state sequence returned 
by the Viterbi algorithm (the one that maximizes p{0\S)) and the mapping 
returned by DTW (the one that minimizes the dissimilarity of the time series) 
are equivalent. 

Using the notation of section 3.2, let W and Y be univariate, real-valued 
time series of length and Ly, respectively, such that: 

W = WiW2 ■ ■■WLw 

Y = yiy2-.-yLY 

(The results below generalize trivially to the multivariate case.) Let Xy be the 
continuous HMM constructed from Y as follows: 

1 Greate one state in Ay for each element of Y. 

2 Let the output density of state i be the normal density with mean pi = yi 
and standard deviation (7^ = cr. Note that the mean of the normal density 
in any given state depends on a value in Y, but the standard deviation is a 
constant which is independent of the state and Y. 

3 Set all transition probabilities to a = 1/7V. That is, all states are equally 
likely as next states regardless of the current state. 
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a=l /N= 1 /3 




Fig. 5. An HMM constructed from the time series Y = {j/i, 2/2, ya}- For each element 
of Y there is a state with a normal output density with mean fu = yi and standard 
deviation Oi = a. All transition probabilities are the same, i.e. a = 1 /N = 1/3. 



An example of a continuous HMM constructed in this manner for Y = {221,2/2,2/3} 
is shown in figure 5. 

Given W and Ay, the Viterbi algorithm finds a state sequence that maximizes 
p{W\S,\y) or, equivalently, ln(p(VF|S', Ay)) as given by equation 5. Because all 
state transitions in Ay have the value a and all standard deviations have the 
value (J, we can simplify equation 5 as follows: 



L\v Lw ^ ( ^2 

ln(p(VF|5, A-^)) = ln(p(5t |st_i)) — V^^ln(cr,) 2 

Lw Lw Lw / \2 

t-1 

^ Lw 

= Lw ln(a) - iiyv^ln(CT) - ^ - ystf (8) 

t=i 

That is, given an HMM Ay constructed from time series Y as described above, 
finding a state sequence S that maximizes p(Y\S, Ay) is equivalent to finding a 
state sequence that maximizes equation 8. 

The first two terms in equation 8 are constant, and thus can be ignored for 
the purpose of maximization. Therefore, ln(p(FF|5', Ay) is maximized when the 
third term in that equation is maximized. Because ct is a positive constant and 
{wt — VstY is positive, we have: 

Lw 
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Therefore, by finding a state sequence that minimizes the following expression, 
the Viterbi algorithm maximizes equation 8: 

Lw 

Y^{wt-ystf 

t=i 

Because states in \y are in a one-to-one correspondence with elements of 
Y, the state sequence returned by the Viterbi algorithm is also a mapping of 
elements in W to elements of Y. That is, the state sequence specifies for each Wi 
which state produced that value. Furthermore, as we have just demonstrated, 
this sequence is one that minimizes the sum of squared differences between the 
elements of W and the elements of Y onto which they are mapped. By equation 
7, this is exactly what DTW does: it finds a mapping of points in W to points 
in Y that minimizes the sum of squared differences. 



5 Clustering with DTW and HMMs 

This section presents an algorithm for clustering time series using only HMMs, 
and then shows how the utility of that algorithm is greatly enhanced by the 
information obtained by first clustering the time series with DTW. 



5.1 Clustering with HMMs 



Definition of “acceptance” of an output sequence by an HMM The 

assumption underlying our method of clustering with HMMs is that all of the 
sequences that belong in a cluster were generated by the same HMM and, as such, 
have high probabilities under this HMM. If a sequence has a high probability 
under a model, we consider it to be generated, or “accepted”, by the model. 
If it has a low probability we consider it to be “rejected”. We apply a simple 
statistical test to check whether an observed sequence O is generated by a given 
model A. We generate a large sample of sequences from the model A. From 
this sample we calculate the empirical probability distribution (see “Computer 
Intensive Methods” in (Cohen 1995)) of log(P(o|A))^, the log-likelihood of the 
sequences generated by the model. Let L be the log-likelihood of O under the 
model, i.e. L = log{P{0\X)) . We then test the hypothesis that L is drawn from 
the probability distribution of the log-likelihood of the sequences generated by 
the model A. Specifically, we test that: 

P\{log-likelihood < L) > threshold 



and reject the null hypothesis if the probability is below the threshold. If the 
hypothesis is rejected and L is considered not to be drawn from the probability 
distribution of the log-likelihood of the sequences generated by A, then we infer 
that the sequence O is not accepted by the model A. 



^ In order to eliminate the influence of the sequence length we actually take the average 
log-likelihood, — '°g(-P(°F)) 

^ ’ sequence length 
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HMM-based clustering of output sequences Due to the above assumption, 
the task of clustering the sequences is equivalent with the task of finding a 
set of hidden Markov models that accept disjoint subsets of the original set of 
sequences. A set of sequences can be clustered by fitting an HMM to all the 
sequences in the set and then applying a fixed point operation that refines the 
HMM and “shrinks” the initial set to the subset of sequences accepted by the 
resulting HMM. Given a set S of sequences and a model HMM, the fixed-point 
operation is: 

— 5o, S 

— repeat 

- So ^ S'o 

— re-train the HMM with Sq 

— Sq < — the sequences in Sq accepted by the HMM 

— until S'o = S'o 

Clustering of the set S proceeds then by repeating the fixed point operation for 
the set {S \ So) of remaining sequences and so forth, until no sequence remains 
unassigned. 

The fixed-point operation converges because at each iteration, S can either 
shrink or stay the same. In the extreme case, S is reduced to one sequence only 
and not to the empty set, because it is unlikely that an HMM trained exactly 
with one sequence will not accept it. 

5.2 Clustering with DTW + HMMs 

When fitting an HMM to a set of sequences, the induction algorithm will try 
to fit all the sequences in the set equally well. Because the number of states is 
set in advance and not learned from the data, it is not clear how the states are 
“allocated” to the different sequences. It is likely that the states’ observation 
probability distributions will cover the regions in the observation space most 
often visited by the given sequences and that the state probability transitions will 
be changed accordingly. This means that if the set contains sequences generated 
by distinct models, it is likely that the induced HMM will be a “compromise” 
between the original models (the most frequent states of either generating model 
will appear in the learned model). It is not clear what this compromise model 
is. Because the training algorithm converges to a local maximum, the resulting 
HMM is highly dependent on the initial model from which the training algorithm 
starts. 

Therefore, if we assume that the sequences in a training set were generated 
by some hidden Markov models and our task is to identify these models, then 
it is advantageous to start the HMM clustering algorithm with even an appro- 
ximate initial clustering. If the majority of sequences in the initial cluster come 
from the same model, then it is likely that the learned compromise HMM will 
be closer to this one (the learned states do not have to cover the regions in 
the observation space frequently visited by the other HMM). Since the DTW 
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clustering technique can provide a good initial partition, the HMM clustering 
algorithm is initialized with it. For each cluster in the DTW partitioning, an 
HMM is created by applying the fixed-point operation described in the previous 
section to the sequences of the cluster. The remaining sequences from each DTW 
cluster are then checked against the HMMs of the other DTW clusters. Finally, 
if any sequences are still unassigned to an HMM, they are placed in a set that 
is clustered solely by HMM clustering. 



6 Experiments 



We tested our algorithm on an artificial dataset generated as in (Smyth 1997), 
from two hidden Markov models. The two hidden Markov models that genera- 
ted the artificial dataset each have two states, one that emits one symbol from 
the normal density with mean 0 and variance 1, A/”(0, 1), and one that emits 
one symbol from A/”(3, 1). The two models differ in their transition probability 
matrices. These matrices are: 

AhMM-i = 




Ahmm2 



.4 .6\ 
.6 .4 ) 



As explained in (Smyth 1997) this is only apparently an easy problem. 

Because the output of the states is continuous and we implemented our clu- 
stering algorithm with discrete HMMs, we discretized the output values with a 
Kohonen network with 20 units (so the output alphabet has 20 symbols in our 
experiments). Again as in (Smyth 1997), the training set consists of 40 sequen- 
ces. The first 20 are generated by HMM\ and the last 20 by HMM 2 - Each 
sequence has length 200. 

The clusters resulting from DTW alone are shown below. For each cluster 
the indices of the time series belonging to that cluster are shown. Ideally, cluster 
1 would contain indices 0 through 19 and cluster 2 would contain indices 20 
through 39. 

- cluster 1: 1 2 3 4 5 6 9 10 11 12 13 15 17 18 19 23 24 33 35 37 

- cluster 2: 0 7 8 14 16 20 21 22 25 26 27 28 29 30 31 32 34 36 38 39 



The resulting DTW-I-HMM clustering is: 

- cluster 1: 1 2 3 4 5 6 9 10 11 13 14 15 16 18 

- cluster 2: 20 21 22 24 26 27 29 30 31 34 36 37 38 39 

- cluster 3: 0 8 12 17 19 23 25 28 33 

- cluster 4: 7 32 35 



The transition matrices of the four HMMs are: 



AhMMi 



f m .30 \ 
\.38 .61 ) 
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Ahmm2 = 

A-HMM3 = 
AnMMi = 



/.29 .70 \ 
1^.64 .35 ) 

/.55 .44 \ 
1^.53 .46 ) 

f.49 .50 \ 
V.46 .53 J 



It can be noticed that the HMM clustering “cleans” the clusters obtained 
by DTW. For example, the sequences 33, 35, 37, that appear in the first cluster 
in the DTW partitioning, are removed by the HMM clustering, and the resul- 
ting DTW-I-HMM cluster has only sequences generated by the first model. The 
second cluster has only sequences generated by the second model, too. It can 
also be noticed that when HMM clustering alone is applied for the sequences 
removed from the DTW clusters, the resulting clusters, 3 and 4, have mixed 
sequences. Thus, HMM clustering alone does not work well: when trained with 
mixed sequences a compromise HMM is learned, rather than an HMM close to 
one of the true models. The transition matrices for the first two models are very 
different: each HMM fits the idiosyncrasies of the sequences emitted by the true 
models. As for the last two models, each of them is a compromise between the 
two original HMMs. The above results indicate that further improvement might 
be obtained by alternating DTW and HMM clustering in repeated iterations. 

We ran experiments where we applied the inverse fixed-point operation; that 
is, the set of sequences assigned to one HMM was “grown” starting from the 
prototype returned by DTW clustering: 



— the initial HMM is created from the prototype sequence 

— iterate until the set of accepted sequences does not change: 

— test all the sequences in the DTW cluster for membership to the current 
HMM 

— retrain the HMM with the set of accepted sequences 



Because it is not guaranteed that at each iteration the set of accepted se- 
quences grows - a previously accepted sequence may no longer be accepted after 
retraining - the process does not always converge. Another problem with star- 
ting the fixed-point operation from the prototype is that the initial HMM might 
overfit the prototype and accept no other sequence. Despite these problems, we 
intend to pursue this method of clustering because it holds promise for better 
clustering than by HMM alone. HMM clustering is hindered by the unknown 
number of states: with enough states the HMM has enough branches to accom- 
modate different output sequences; with few states it may over generalize and 
again accept different sequences. Provided that the adequate number of states 
is found, creating the HMM from the prototype may lead to a model with one 
“main” (very likely) state path. Such a model corresponds to our intuitive no- 
tion of similarity of time series: the observed output sequences are generated by 
random walks along similar state sequences. 

We also applied the technique to clustering time series of robot sensor values. 
The time series were collected during several simple experiences: the robot was 
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approaching, pushing or passing an object. While we do not know the true clu- 
sters in the robot data, we considered a good clustering one which reflects the 
kinds of experiences enumerated above. We observed the same effect of “clea- 
ning” the DTW clusters by the HMM, but the set of sequences removed by the 
HMM fixed-point operation was large and poorly clustered by the HMM cluste- 
ring method. Also, the initial DTW clustering of robot data is not as good as the 
one for artificial data. As such, the HMM clustering algorithm does not benefit 
from the initial DTW partition. In this case, both DTW clustering and HMM 
clustering alone are better than the DTW -I- HMM combination. These are the 
results obtained by HMM clustering alone for the robot data. For each time 
series we marked the events that we considered important, for example “push 
red object” or “pass red object on left”. These results illustrate the problem of 
“branchy” HMMs, that accept distinct sequences on different state paths. 

cluster 1, prototype 91 

— 4 8 20 31 63 91 112 134: push red object 

— 16 36 54 138: pass red object on right 

— 107 123 127 129: pass red object on left 

— 96: approach right red object 
cluster 2, prototype 94 

— 1 6 41 58 61 77: crash into red object 

— 94 98: approach left red object 
cluster 3, prototype 125 

— 59 73 125: push red object 

— 18: crash into red object 
cluster 4, prototype 116 

— 45 75 116: pass red object on right 

— 32: crash into red object 

— 158: push red object 
cluster 5, prototype 146 

— 80 146: crash into red object 

We think that the problem of the unknown number of HMM states must 
be solved before trying to cluster and represent real data with HMMs. We plan 
to apply the minimum description length principle for tackling this difficult 
problem. 

7 Conclusion 

We presented a hybrid time series clustering algorithm that uses dynamic time 
warping and hidden Markov model induction. The algorithm worked well in 
experiments with artificial data. The two methods complement each other: DTW 
produces a rough initial clustering and the HMM removes from these clusters the 
sequences that do not belong to them. The downside is that the HMM removes 
some good sequences along with the bad ones. We suggested possible ways of 
improving the method and are currently working on validating them. 
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1 Introduction 

One of the fundamental aspects of human intelligence is the ability to process 
temporal information (Lashley, 1951). Learning and reproducing temporal se- 
quences are closely associated with our ability to perceive and generate body 
movements, speech and language, music, etc. A considerable body of neural net- 
work literature is devoted to temporal pattern generation (see Wang, 2001, for 
a recent review) . These models generally treat a temporal pattern as a sequence 
of discrete patterns, called a temporal sequence. Most of the models are based 
on either multilayer perceptrons with backpropagation training or the Hopfield 
model of associative recall. The basic idea for the former class of models is to 
view a temporal sequence as a set of associations between consecutive com- 
ponents, and learn these associations as input-output transformations (Jordan, 
1986; Elman, 1990; Mozer, 1993). To deal with temporal dependencies beyond 
consecutive components, part of the input layer is used to keep a trace of history, 
behaving as short-term memory (STM). Similarly, for temporal recall based on 
the Hopfield associative memory, a temporal sequence is viewed as associati- 
ons between consecutive components. These associations are stored in extended 
versions of the Hopfield model that includes some time delays (Sompolinsky & 
Kanter, 1986; Buhmann & Schulten, 1987; Heskes & Gielen, 1992). To deal with 
longer temporal dependencies, high-order networks have been proposed (Guyon 
et ah, 1988). 



1.1 Learning Complex Sequences 

One of the main problems with the above two classes of models lies in the 
difficulty in retrieving complex temporal sequences, where the same part may 
occur many times in the sequence. Though proposed remedies can alleviate the 
problem to some degree, the problem is not completely resolved. In multilayer 
perceptrons, a blended form of STM becomes increasingly ambiguous when tem- 
poral dependencies increase (Bengio et ah, 1994). The use of high-order units 

* Thanks to X. Liu for his help in typesetting. The preparation of this chapter was 
supported in part by an ONR YIP award and a grant from NUWC. 
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in the Hopfield model requires a huge number of connections to deal with long 
range temporal dependencies, or the model yields ambiguities. 

More recently, Bradski et al. (1994) proposed an STM model, which exhibits 
both recency and primacy, where the former means that more recent items in 
a sequence are better retained and the latter means that the beginning items 
are less prone to forgetting. Both recency and primacy are characteristics of 
human STM. In addition, their model creates new representations for repeated 
occurrences of the same symbol, thus capable of encoding complex sequences to 
a certain extent. Granger et al. (1994) proposed a biologically motivated mo- 
del for encoding temporal sequences. Their model uses a competitive learning 
rule that eventually develops sequence detectors at the end of sequence pre- 
sentation. Each detector encodes a sequence whereby the beginning component 
has the highest weight, and the subsequent components have successively lower 
weights. They claim that the network has an unusually high capacity. However, 
it is unclear how their network reads out the encoded sequences. Baram (1994) 
presented a model for memorizing vector sequences using the Kanerva memory 
model (Kanerva, 1988). The basic idea is similar to those models that are based 
on the Hopfield model. Baram’s model uses second-order synapses to store the 
temporal associations between consecutive vectors in a sequence, but the mo- 
del deals only with sequences that contain no repeating vectors. Rinkus (1995) 
proposed a model of temporal associative memory, based on associations among 
random sequences. The associations are encoded using a method similar to the 
associative memory of Willshaw et al. (1969). However, the model needs an ad- 
ditional operation that maps a sequence component in a semi-random vector 
and remembers the mapping for later decoding. 

Based on the idea of using STM for resolving ambiguities, Wang and Arbib 
(1990) proposed a model for learning to recognize and generate complex sequen- 
ces . With an STM model, a complex sequence is acquired by a learning rule 
that associates the activity distribution in STM with a context detector (for a 
rigorous definition see Section 2) . For sequence generation, each component of a 
sequence is associated with a context detector that learns to associate with the 
component. After successful training, a beginning part of the sequence forms an 
adequate context for activating the next component, and the newly activated 
component joins STM to form a context for activating the following component. 
This process continues until the entire sequence is generated. A later version 
(Wang & Arbib, 1993) deals with the issues of time warping and chunking of 
subsequences. In particular, sequences in this version can be recognized in a 
hierarchical way and without being affected by presentation speed. Hierarchical 
recognition enables the system to recognize sequences whose temporal depen- 
dencies are much longer than the STM capacity. In sequence generation, the 
system is capable of maintaining relative timing among the components while 
the overall rate can change. 

Recently, L. Wang (1999) proposed to use multi-associative neural networks 
for learning and retrieving spatiotemporal patterns. STM is coded by systema- 
tic delay lines. The basic idea is that, when dealing with complex sequences , 
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one pattern is allowed to be associated with a set of subsequent patterns, and 
disambiguation can be eliminated by intersecting multiple sets associated by the 
previous pattern, the pattern prior to the previous pattern, and so on. It is easy 
to see that a complex sequence can be unambiguously generated with a suffi- 
cient number of systematic delay lines. Associations between spatial patterns are 
established through single units in a competitive layer. A major drawback of the 
multi-associative network model is that much of system architecture and many 
network operations are algorithmically described, rather than arising from an au- 
tonomous neural network (e.g., no discussion on how set intersection is neurally 
implemented) . 



1.2 Sequential Learning Problem 

A comprehensive model of temporal sequence learning must address the issue of 
sequentially learning multiple sequences; that is, how are new sequences learned 
after some sequences have been acquired? One way of learning multiple sequences 
is to use simultanous training, where many sequences are learned at once. A 
model that can learn one sequence can generally be extended to learn multiple 
sequences with simultaneous training. A straightforward way is to concatenate 
multiple sequences into a single long sequence. Given that each sequence has 
a unique identifier, a model can learn all of the sequences if it can learn the 
concatenated sequence. However, sequential learning of multiple sequences is an 
entirely different matter. It is a more desirable form of training because it allows 
the model to acquire new sequences without bringing back all the previously 
used sequences - a form of incremental learning . Incremental learning not only 
conforms well with human learning, but also is important for many applications 
that do not keep all the training data and where learning is a long-term on-going 
process. 

It turns out that incremental learning is a particularly challenging problem 
for neural networks. In multilayer perceptrons, it is well recognized that the net- 
work exhibits so called catastrophic interference, whereby later training disrupts 
the traces of previous training. It was pointed out by Grossberg (1987), and sy- 
stematically revealed by McGloskey and Gohen (1989) and Ratcliff (1990). Many 
subsequent studies attempt to address the problem, and most of proposed re- 
medies amount to reducing overlapping in hidden layer representations by some 
form of orthogonalization, a technique used long ago for reducing cross-talks 
in associative memories (see Kruschke, 1992, Sloman & Rumelhart, 1992, and 
French, 1994). Most of these proposals are verified only by small scale simulati- 
ons, which, together with the lack of rigorous analysis, make it difficult to judge 
to what extent the proposed methods work. It remains to be seen whether a 
general remedy can be found for multilayer perceptions. Associative memories 
are less susceptible to the problem, and appear to be able to incorporate more 
patterns easily so long as the overall number of patterns does not exceed the 
memory capacity. However, the Hopfield model has a major difficulty in dealing 
with correlated patterns with overlapping components (Hertz et ah, 1991). The 
Hopfield model has been extended to deal with correlated patterns (Kantor & 
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Sompolinsky, 1987), and Diederich and Opper (1987) proposed a local learning 
rule to acquire the necessary weights iteratively. The local learning rule used by 
them is very similar to the perceptron learning rule. Thus, it appears that such 
a scheme for dealing with correlated patterns would suffer from catastrophic 
interference. 

The major cause of catastrophic interference is the distributedness of repre- 
sentations; the learning of new patterns needs to use and alter those weights 
that participate in representing previously learned patterns. There is a tradeoff 
between distributedness and interference. Models that use non-overlapping re- 
presentations, or local representations, do not exhibit the problem. For example, 
the ART model (Carpenter & Grossberg, 1987) does not have the problem be- 
cause each stored pattern uses a different weight vector and no overlapping is 
allowed between any two weight vectors. 

During sequential learning , humans show some degree of interference. Re- 
troactive interference has been well documented in psychology (Crooks & Stein, 
1991), which occurs when learning a later event interferes with the recall of earlier 
information. In general, the similarity between the current event and memori- 
zed ones is largely responsible for retroactive interference (Barnes & Underwood, 
1959; Chandler, 1993; Bower et ah, 1994). Animals also exhibit retroactive inter- 
ference (Rodriguez et ah, 1993). The existence of retroactive interference suggests 
that events are not independently stored in the brain, and related events are so- 
mehow intertwined in the memory. Although recall performance of the interfered 
items decreases, it still is better than the chance level, and it is easier to relearn 
these items than to learn them for the first time. This analysis suggests that a 
memory model that stores every item independently cannot adequately model 
human/animal memory. From the computational perspective, the models that 
store different events in a shared way have a better storage efficiency than those 
that do not. In summary, a desired memory model should exhibit some degree 
of retroactive interference when learning similar events, but not catastrophic 
interference. 



In this chapter, we describe the anticipation model for temporal sequence 
learning (Wang & Yuwono, 1995; Wang & Yuwono, 1996). Similar to Wang and 
Arbib (1990), an STM model is used for maintaining a temporal context. In lear- 
ning a temporal sequence, the model actively anticipates the next component 
based on STM. When the anticipation is correct, the model does nothing and 
continues to learn the rest of the sequence. When the anticipation is incorrect, 
namely a mismatch occurs, the model automatically expands the context for the 
component. A one-shot (single step) normalized Hebbian learning rule is used 
to learn contexts, and it exhibits the mechanism of temporal masking, where a 
sequence masks its subsequences in winner-take-all competition. The anticipa- 
tion model can learn to generate an arbitrary sequence by self-organization, thus 
avoiding supervised teaching signals as required in Wang and Arbib. Further- 
more, the model is examined in terms of its performance on sequential training 
tasks. We show that the anticipation model is capable of incremental learning , 
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and exhibits retroactive interference but not catastrophic interference. Extensive 
simulations reveal that the amount of retraining is relatively independent of the 
number of sequences stored in the model. Furthermore, a mechanism of chun- 
king is described that creates chunks for recurring subsequences. This chunking 
mechanism significantly improves training and retraining performance. 

The remaining part of the chapter is organized as follows. In Section 2, the 
anticipation model is fully defined. Section 3 introduces several rigorous results 
of the anticipation model. In Section 4, we provide simulation results of the 
model, in particular for learning many sequences incrementally. The simulation 
results suggest that incremental learning in the anticipation model is capacity- 
independent, or unaffected by the number of stored sequences. Section 5 de- 
scribes the chunking mechanism and shows how chunking improves the learning 
performance. Section 6 provides some general discussions about the anticipation 
model. Finally, Section 7 concludes the chapter. 

2 Anticipation Model 

We follow the terminology introduced by Wang and Arbib (1990). Sequences are 
defined over a symbol set T, which consists of all possible symbols, or spatial 
(static) patterns. Sequence S of length N over F is defined as pi-p 2 -...-pN, where 
Pi(l < * < N) € r is called a component of S. The sequence pj-pj+i-...-pk, 
where 1 < j < A: < A^, is a subsequence of S', and the sequence pj-pj+i~...-pN 
where 1 < j < A^, is a right subsequence of S. In general, in order to produce 
a component by its predecessors in a sequence, a prior subsequence is needed. 
For example, to produce the first “P” in the sequence M-FS-S-FS-S-FP-P-I 
requires the prior subsequence S-FS-S-I. This is because FS-S-I is a recurring 
subsequence. Hence, the context of pi is defined as the shortest prior subsequence 
of Pi that uniquely determines Pi in S. The degree of Pi is the length of its context. 
The degree of S is the maximum degree of all of the components of S. Therefore, 
a simple sequence, where each component is unique, is a degree 1 sequence and 
a complex sequence, which contains recurring subsequences, is a sequence whose 
degree is greater than 1. 



2.1 Basic Network Description 

We now describe the basic components of the anticipation model. Fig. 1 shows 
the architecture of the network. The network consists of a layer of n input termi- 
nals, each associated with a shift-register (SR) assembly, and a layer of m context 
detectors, each associated with a modulator. Each SR assembly contains r units, 
arranged so that the input signal stimulating an input terminal shifts to the next 
unit every time step. SR assemblies serve as STM for input signals. Each detec- 
tor receives input from all SR units, and there are lateral connections, including 
self-excitation, within the detector layer that form winner-take-all architecture. 
These connections and competitive dynamics lead to the detector that receives 
the greatest ascending input from SR units to be the sole winner of the entire 
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detector layer. In addition to the ascending and lateral connections, each de- 
tector also connects mutually with its corresponding modulator, which in turn 
connects directly with input terminals. 




shift-register 

assembly 



modulator 



Fig. 1. Architecture of the anticipation model. Thin solid lines denote modifiable 
connections, and thick or dash lines denote fixed connections. The connections bet- 
ween terminals and modulators are bidirectional. 



The model internally anticipates the next component and compares it with 
the external input through the modulator layer. Each modulator unit receives 
upward connections from every individual terminal. In addition, it receives a 
downward connection from its respective detector. An active detector enables 
its corresponding modulator in the next time step (assuming some delay) . Once 
enabled, the modulator performs one-shot learning that updates its connection 
weights from the terminals. Since only one terminal corresponding to an input 
component can be active at any time step, one-shot learning leads to one-to- 
one connection from an active terminal to an enabled modulator. Basically, this 
one-shot learning establishes the association between a context detector and the 
next input component. If the active terminal and the enabled modulator do not 
match next time when the detector is activated, the anticipated activation of 
the modulator will be absent. This mismatch will be detected by the modulator, 
which in turn will send a signal to its respective detector to expand the context 
that the detector is supposed to recognize. 
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2.2 Model Description 

The activity of detector i at time t, Ei{t), is defined as: 



E^{t) = ( 1 ) 

j,k 



g{x,y) = 



X 

0 



if X > j/ 
otherwise 



(2) 



where Wijk is the connection weight from the fcth SR unit of assembly j to 
detector i. Vjk{t) is the activity of this SR unit at time t. 6i is an adjustable 
threshold for the detector, which is initialized to 0. 9^ may be increased when 
detector i wins winner-take-all competition in the detector layer, to be discussed 
later. Ai is defined later in (5). The activity Vjk{t) is given as follows. 



Vjk{t) 



Ij (t) 

max(0, - 1) - (5) 



if /c = 1 (head unit) 
otherwise 



(3) 



where Ij(t) is the binary activity of terminal j, that is, Ij(t) = 1 if the cor- 
responding symbol of terminal j is being presented to the network at time t, 
and Ij(t) = 0 otherwise. Due to the nature of sequential input, at most one 
terminal has its I equal to 1 at t. <5 is a decay parameter. Eq. 3 provides an 
implementation of the STM model described earlier, i.e., an input activity is 
held for a short time but decays gradually in a shift-register assembly. If assem- 
bly j is stimulated by an input at time t, namely Ij{t) = 1, according to (3) 
the end unit of the assembly gets activated at time t + r — 1, and its activity 
Vjr{t + r — 1) = max{0, 1 — i5(r — 1)). Apparently, the input cannot be held longer 
than r steps, the limit of STM capacity. Given r, in order for the input to be 
held for r steps, the parameter must be chosen so that 1 — 5(r — 1) > 0. That 
is, (5 < l/(r — 1). 

As mentioned earlier, all the detector units in the detector layer form a 
winner-take-all network. The detailed dynamics of winner-take-all can be found 
in Grossberg (1976). In such a competitive network, the activity of each detector 
evolves until the network reaches equilibrium, at which point the detector with 
the highest initial activity becomes the only active unit. The network takes a 
short time to reach equilibrium. This time period should be much shorter than 
the duration of one sequence component. Therefore, we assume that each discrete 
time step is longer than the time needed for the winner-take-all mechanism to 
settle at an equilibrium. 

The learning rule for each detector z is a Hebbian rule (Hebb, 1949) plus 
normalization to keep the overall weight a constant (von der Malsburg, 1973; 
Wang & Arbib, 1990), and it is denoted as a normalized Hehhian rule, 

Wijk{t -k 1) = Wijfe(t) + aO^{t)g{Vjk{t),Ai) (4a) 

+ 1 ) 

CtC + Thjk + 1 ) 



Wi,jk{t + 1) 



(4b) 
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where a is a gain parameter or learning rate. A large a makes training fast. 
It is easy to see that very large a leads to approximate one-shot learning. As 
mentioned earlier, winner-take-all competition in the detector layer will activate 
a single unit from the layer. To indicate the outcome while omitting the details 
of competitive dynamics, let Oi(t) equal 1 if detector i is the winner of the 
competition, or 0 otherwise. Function g, as defined in (2), serves as a gate to 
let in the influences of only those SR units whose activities are greater than or 
equal to Ai. Ai is the sensitivity parameter of unit i. The lower the sensitivity 
parameter the more SR units can be sensed by a winning detector, and thus more 
connections of the detector can be modified according to (4a) . Furthermore, the 
sensitivity parameter Ai is adaptive by itself: 

A=l ^ ifd. = 0 .. 

® ( max(0, 1 — 6{di — 1)) if di > 0 ^ 

where dt, indicating the degree of detector i, is a non-negative integer, initialized 
to 0. (5 is the decay parameter introduced in (3). According to (5), Ai is equal 
to 1 when = 0 or 1, and decreases until 0 as di increases. Since value 1 is the 
activity level of the corresponding head unit when some assembly is stimulated, 
detector i will only sense one SR unit - a head unit - when di = 0 or 1. When 
di increases, more SR units are sensed. Except when di = 0, di is equal to the 
number of units that detector i can sense when it becomes a winner. The constant 
C in (4b) is positive, and its role will be described in Sect. 3. The connection 
weight Wijfc is initialized to l/[r(\ -I- C) -I- e], where e is a small random number 
introduced to break symmety between the inputs of the detectors, which may 
cause problems for competitive dynamics. 

Let unit 2 be the winner of the competition in the detector layer. As a result 
of the updated connection weights, the activity of unit z will change when the 
same input is presented in the future. More specifically, Ez is monotonically non- 
decreasing as learning takes place. This observation will be further discussed in 
the next section. The resulting, increased, activity in (1) is then used to update 
the threshold of unit z. This is generally described as: 

6i{t -|- 1) = 9i{t) + Oi{E * (t -|- 1) — 0i{t)) (6) 

where E*{t + 1) is the activity of i based on the new weights, i.e. E*{t -I- 1) = 
+ l)g{Vjk{t),Ai). Thus, 9i is adjusted to if unit i is the winner. 
Otherwise, 9i remains the same. Due to this adjustment, unit z increases its 
threshold so that it will be triggered only by the same subsequence whose com- 
ponents have been sensed during weight updates by (4a). The above threshold 
can be relaxed (lowered) a little when handling sequences with certain distorti- 
ons. This way, subsequences very close to the training one can also activate the 
detector. 

A modulator receives both a top-down connection from its corresponding 
detector and bottom-up connections from input terminals (Fig. 1). We assume 
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that the top down connection modulates the bottom-up connections by a mul- 
tiplicative operation. Thus, the activity of modulator i is defined as, 

n 

= (7) 

i=i 

where Rij is a binary weight of the connection from terminal j to modulator 
i. All Rij’s are initialized to 0. We assume that the top-down signal takes one 
step to reach its modulator. Because at most one terminal is active (I{t) = 1) at 
any time, Mi{t) is also a binary value. If Oi{t — 1) = 1 and Mi{t) = 0 then the 
modulator sends a feedback signal to its corresponding detector. Upon receiving 
this feedback signal, the detector increases its degree, thus lowering its sensitivity 
parameter Ai (see Eq. 5). Quantitatively, di is adjusted as follows: 

^ ^ ( di + Oi{t - 1) if Mi(t) = 0 

® ( di otherwise ' 

The situation where Oi{t — T) = 1 and Mi{t) = 0 is referred to as a mismatch. 
A mismatch occurs when an anticipated component in the sequence does not 
appear, to be explained shortly. Thus the degree of a context detector increases 
when a mismatch occurs. 

Finally, one-shot learning is performed on the bottom-up connection weights 
of the modulator of the winning detector z, 

R.j = m (9) 

This one-shot learning sets the connection weights of modulator 2 to the current 
activities of the input terminals. Since there is only one active terminal at time 
t, i.e., the one representing the current input symbol, only one bottom-up weight 
of the modulator is equal to one, and all the others are zero. This training results 
in a one-to-one association between a modulator and a terminal. 

We now explain under what condition a mismatch occurs, which leads to an 
increment of the degree of the winning detector. According to (7), a mismatch 
occurs when Oi{t — 1) = 1 and = 9. Since at any time, only one 

bottom-up weight of modulator i equals 1 and only one input terminal (Ij) is ac- 
tive, mismatch occurs when the non zero weight and the terminal with non zero 
input do not coincide. But one-shot learning of Eq. 9 establishes a non zero link 
only between a modulator and the next input terminal. Therefore, a mismatch 
occurs if the link between detector i and an input terminal established last time 
when detector i was activated does not coincide with the active input terminal 
this time (at time t). The bottom-up links of a modulator established between 
the modulator (or the corresponding detector) and the next input component 
are used for the modulator to anticipate the next component in sequence ge- 
neration. Thus a mismatch corresponds to where the anticipated input symbol 
does not match with the actual input during sequence training. Since Rij’s are 
all initialized to 0, following (7) a mismatch is bound to occur the first time a 
pair of consecutive components is presented, which then increases the degree of 
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the detector for the first component from 0 to 1. If the sequence to be learned is 
a simple sequence, like A-B-C-D-E, it suffices to increase the degrees of all in- 
volved detectors to 1. For complex sequences , though, the degree of the relevant 
detectors need further increase until no mismatch occurs. 

The training is repeated each time step. After all sequence components have 
been presented, the entire cycle of training, referred to as a training sweep, is 
repeated. The training phase is completed when there is no mismatch during 
the last training sweep. In this case the network correctly anticipates the next 
component for the entire sequence. The completion of the learning phase can be 
detected in various ways. For example, a global unit can be introduced to sum 
up all feedback from modulators to their respective context detectors during a 
training sweep. In this case, an inactive global unit by the end of a sweep signals 
the end of the training phase. 



3 Analytical Results 



In this section, we summarize several analytical results on the anticipation model. 
These results are listed in the form of propositions without proofs, and the 
interested reader is referred to Wang and Yuwono (1995; 1996) for detailed proofs 
of these results. 



Proposition!. The normalized Hebbian rule of (4) with the following choice 
of parameter C 



C > 



leads to a property called temporal masking-, the detector of sequence S is pre- 
ferred to the detectors of the right subsequences of S. In other words, when 
sequence S occurs, the detector that recognizes S masks those detectors that 
recognize the right subsequences of S. This property is called temporal masking, 
following the term masking fields introduced by Cohen and Grossberg (Cohen 
& Grossberg, 1987), which state that larger spatial patterns are preferred to 
smaller ones when activating their corresponding detectors. 



Inequality (10) tells us how to choose C based on the value of 5 in order 
to ensure that the detector of a sequence masks the detectors of its left subse- 
quences. The smaller is 6, the smaller is the right-hand-side of (10), and thus 
the smaller C can be chosen to satisfy the inequality. As a degenerate case, if 
5 = 0, (10) becomes C > 0, and this corresponds to exactly the condition of 
forming masking fields in static pattern recognition (Cohen & Grossberg, 1987). 
Therefore, (10) includes masking fields as a special case. In temporal processing, 
5 reflects forgetting in STM, and thus cannot be 0. On the other hand, 5 should 
be smaller than l/(r — 1) in order to fully utilize SR units for STM (see the 
discussion in Sect. 2.2), thus the degree of the learnable sequences (see Eq. 3). 



Proposition2. The following two conclusions result from the learning algo- 
rithm: (a) At any time, a detector can be triggered by only a single sequence; (b) 
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Except for initial training, once a unit is activated by sequence S, it can only be 
activated by S' or a sequence that has S as a right subsequence. Because of (a), 
one can say that a detector is tuned to the unique sequence which can trigger a 
detector. 

Propositions. An anticipation model with m detectors, and r SR units for 
each of n SR assemblies can learn to generate an arbitrary sequence S of length 
< m and degree < r, where S is composed of symbols from P with lEl < n. 

Once training is completed, the network can be used to generate the se- 
quence it has been trained on. During sequence generation, the learned connec- 
tions from input terminals to modulators are used reversely for producing input 
components. Sequence generation is triggered by the presentation of the first 
component, or a sequence identifier. This presentation will be able to trigger an 
appropriate detector which then, through its modulator, leads to the activation 
of the second component. In turn, the newly activated terminal adds to STM, 
which then forms an appropriate context to generate another component in the 
sequence. This process continues until the entire sequence is generated. 

Aside from Proposition 3, learning is efficient - it generally takes just a few 
training sweeps to acquire a sequence. This is because the anticipation model 
employs the strategy of least commitment. The model views, as a default, the 
sequence to be learned as a simple one, and expands the contexts of sequence 
components only when necessary. Another feature of the model is that, depen- 
ding on the nature of the sequence, the system can yield significant sharing 
among context detectors: the same detector may be used for anticipating the 
same symbol that occurs many times in a sequence. As a result, the system 
needs fewer detectors to learn complex sequences than the model of Wang and 
Arbib (1990; 1993). 

The above results are about learning a single sequence, whereby a sequence 
is presented to the network one component at a time during training. When 
dealing with multiple sequences, each sequence is assumed to be unique, be- 
cause learning a sequence that has been acquired corresponds to recalling the 
sequence. We assume that the first component of a sequence represents the uni- 
que identifier of the sequence. To facilitate the following exposition, we define 
a sequential learning procedure as the following. The training process proceeds 
in rounds. In the first round, the first sequence is presented to the network in 
repeated sweeps until the network has learned the sequence. The second round 
starts with the presentation of the second sequence. Once the second sequence 
is acquired by the network, the network is checked to see if it can generate the 
first sequence correctly when presented with the identifier of the sequence. If 
the network can generate the first sequence, the second round ends. Otherwise, 
the first sequence is brought back for retraining. In this case, the first sequence 
is said to be interfered by the acquisition of the second sequence. If the first 
sequence needs to be retrained, the second sequence needs to be checked again 
after the retraining of the first sequence is completed, since the latter lead to the 
interference of the second sequence. The second round completes when both se- 
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quences can be produced by the network. In the third round, the third sequence 
is presented to the network repeatedly until it has been learned. The network is 
then checked to see if it can generate the first two sequences; if yes, the third 
round is completed; if not, retraining is conducted. In the latter case, retraining 
is always conducted on the sequences that are interfered. The system sequen- 
tially checks and retrains each sequence until every one of the three sequences 
can be generated by the network - that ends the third round. Later sequences 
are sequentially trained in the same manner. It is possible that a sequence that 
is not interfered when acquiring the latest sequence gets interfered as a result of 
the retraining of some other interfered sequences. Because of this, retraining is 
conducted in a systematic fashion as the following. All of the previous sequences 
plus the current one are checked sequentially and retrained if interfered. This 
retraining process is conducted repeatedly until no more interference occurs for 
every sequence learned so far. Each such process is called a retraining cycle. Thus 
a round in general consists of repeated retraining cycles. 

If a system exhibits catastrophic interference, it cannot successfully complete 
a sequential learning procedure with multiple sequences. The system instead will 
show endless oscillations between learning and relearning different sequences. In 
the case of two sequences, for example, the system can only acquire one sequence 
- the latest one used in a sequential training procedure. Thus, the system will 
be stuck in the second round. 

Proposition^. Given sufficient numbers of detectors and SR units for each 
shift-register assembly, the anticipation model can learn to produce a finite num- 
ber of sequences sequentially. 

In Proposition 4, the number of units in each SR assembly, or the STM 
capacity, must be sufficient to handle long temporal dependencies in the context 
of multiple sequences. It should be clear that the complexity of a sequence may 
increase when it is trained with other sequences. For example, X-A-B-C and Y- 
A-B-D are both simple sequences when taken separately. But when the system 
needs to memorize both sequences, A-B becomes a repeating sequence, and 
as a result both become complex sequences . We define the degree of a set 
of sequences as the maximum length of all the shortest prior sequences that 
uniquely determine all the components of all the sequences in the set. Because 
the first component of each sequence is its unique identifier, the definition of the 
set degree does not depend on how this set of sequences is ordered. Moreover, 
the degree of a set of sequences must be smaller than the length of the longest 
sequence in the set. With this definition, it is sufficient to satisfy the condition 
of Proposition 3 if the number of SR units in each assembly is greater than or 
equal to the set degree. 

The sequential learning procedure is not necessary for the validity of Propo- 
sition 4. A more natural procedure of sequential training is postpone retraining 
until interfered sequences need to be recalled in a specific application. This pro- 
cedure is more consistent with the process of human learning. One often does 
not notice memory interference until being tested in a psychological experiment 
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or daily life. This learning procedure blurs the difference between learning a new 
sequence for the first time and relearning an interfered sequence. Proposition 4 
essentially implies that more and more sequences will be acquired by the system 
as the learning experience of the model extends. This is an important point. As 
a result, the anticipation model can be viewed as an open learning system. No 
rigid procedure for sequential training is needed for the system to increase its 
long-term memory capacity. The model automatically increases the capacity by 
just focusing on learning the current sequence. Also, the number of detectors 
needed to satisfy Proposition 4 can be significantly smaller than the upper limit 
of ~ 1) for k sequences. This is because detectors can be shared wit- 

hin the same sequence as well as across different sequences. This will be further 
discussed in the next section. 

4 Simulation Results 

To illustrate the model’s capability in learning an arbitrarily complex sequence, 
we show the following computer simulation for learning input sequence <TO - 
BE - OR - NOT - TO - BE>. The simulated network has 24 detector units, 24 
terminals, and 6 SR units for each shift-register assembly (144 SR units in total). 
Figure 2 shows the activity trace of the network from a simulation run. We use 
symbol as the end marker, and symbol as a distinct symbol separating 
meaningful words. The network learned the sequence in 5 training sweeps. In 
the last training sweep, the system correctly anticipates every component of the 
sequence, as shown in the last column of the figure. After this training, the entire 
sequence can be correctly generated by the presentation of its first component, T 
in this case, and the activity trace will be the same as the last sweep of training. 
The degree of the sequence is 6, used to set r. 

Once one sequence is learned, a right subsequence of the sequence can be ge- 
nerated from a middle point of the sequence. In the above example, with symbol 
R as the initial input the network will correctly generate the remaining part of 
the sequence <-NOT - TO - BE>. Component R was chosen because it forms 
the degree 1 context for its successor In general, it requires a subsequence 
as an input to generate the remaining part of the sequence. A subsequence can 
activate an detector which then produces a certain component. The component 
can then join the subsequence to activate another detector, and so on, until the 
remaining part is fully generated. This feature of the model conforms with the 
experience that one can often continue a familiar song or a piece of music being 
exposed to a part of it. 

Proposition 4 guarantees that the anticipation model does not suffer from 
catastrophic interference. Interference exists nonetheless in sequential training, 
because committed detectors may be seized by later training or retraining to 
make different anticipation. For example, assume that the system is sequenti- 
ally trained with two simple sequences, Sa- C-A-T and Sb- E-A-R. After Sa is 
learned, the training with Sb will lead to the following situation. The previously 
established link from A to T will be replaced by a link from E-A to R. Thus, Sa 
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Actual ® Anticipated Anticipated = Actual 



Fig. 2. Training and generation of the sequence <TO-BE-OR-NOT-TO-BE>. The 
activity traces of the input terminals are shown, where a black box represents an actual 
terminal activity, a gray box represents an anticipated activity that does not match the 
actual input, and a white box represents a match between an anticipated and an actual 
input activity. The training phase takes 5 training sweeps. The generation process is 
initiated by presenting the first component of the sequence, T. The parameter values 
used are: a = 0.2, S = 0.1, and C = 3.0. 



is interfered and cannot be generated after Sb is acquired. The critical question 
is what kind of interference is exhibited by the model, and how severely does it 
affect learning performance? The extent of interference depends on the amount 
of overlap between the sequence to be learned and the sequences already stored 
in the memory. Clearly, if a new sequence has no component in common with 
the stored sequences, the sequence can be trained as if nothing had been lear- 
ned by the model. In this sense, interference is caused by the similarity between 
the sequence and the memory. This is consistent with psychological studies on 
retroactive interference (see Sect. 1.2). 

Knowing that the amount of interference, and thus retraining, depends on 
the overlap of the sequences to be learned, we wanted to evaluate the system by 
arbitrarily selecting a domain that contains a lot of overlaps among the sequen- 
ces. The database of the sequences used consists of the titles of all sessions that 
were held during the 1994 IEEE International Conference on Neural Networks 
(ICNN-94). This database has 97 sequences, as listed in Table 1. These titles 
are listed without any change and in exactly the same order as they appear in 
the final conference program, even retaining the obvious mistakes printed on the 
program. Evident from the table, there are many overlapping subsequences wit- 
hin the database. Thus, these sequences provide a good testbed for evaluating 
sequential training and retroactive interference. 

For training with this database, the chosen network has 131 input terminals - 
34 for the symbol set (26 English letters plus (space), 

and “/”) and 97 for the identifiers of the 97 sequences. Each SR assembly 
contains 40 units. Also, the network needs at least 1,088 detectors and 1,088 
modulators. Hence, the network has a total of 7,567 units. The parameters of 
the network are: a = 0.2, <5 = 1/40 and C = 535. To measure the extent of 
interference, we record the number of retraining sweeps required to eliminate 
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Table 1. Sequence Base for Sequential Training 



no. 


Sequence 


no. 


Sequence 


1 


Social Sz philosophical implications of 
computational intelligence 


50 


Image recognition 


2 


Neurocontrol research: real-world perspectives 


51 


Medical applications 


3 


Fuzzy neural systems 


52 


Parallel architectures 


4 


Advanced analog neural networks and applications 


53 


Associative memory I 


5 


Neural networks for control 


54 


Pattern recognition IV 


6 


Neural networks implementations 


55 


Supervised learning III 


7 


Hybrid systems I.D. 


56 


Learning and memory IV 


8 


Artificial life 


57 


Intelligent control IV 


9 


Learning and recognition for intelligent control 


58 


Economic/Finance/Business 

applications 


10 


Artificially intelligent neural networks 


59 


Machine vision I 


11 


Hybrid systems II 


60 


Machine vision 


12 


Supervised learning X 


61 


Architecture I 


13 


Intelligent neural controllers: algorithms and 
applications 


62 


Supervised learning V 


14 


Who makes the rules? 


63 


Speech I 


15 


Pulsed neural networks 


64 


Robotics 


16 


Fuzzy neural systems II 


65 


Associative memory II 


17 


Neural networks applications to estimation and 
identification 


66 


Medical applications II 


18 


Adaptive resonance theory neural networks 


67 


Modular/Digital implementations 


19 


Analog neural chips and machines 


68 


Pattern recognition VI 


20 


Learning and memory I 


69 


Robotics II 


21 


Pattern recognition I 


70 


Unsupervised learning I 


22 


Supervised learning I 


71 


Optimization I 


23 


Intelligent control I 


72 


Applications in image recognition 


24 


Neurobiology 


73 


Architecture HI 


25 


Cognitive science 


74 


Optimization II 


26 


Image processing HI 


75 


Supervised learning VII 


27 


Neural network implementation II 


76 


Associative memory IV 


28 


Applications of neural networks to power systems 


77 


Robotics HI 


29 


Neural system hardware I 


78 


Speech HI 


30 


Time series prediction and analysis 


79 


Unsupervised learning II 


31 


Probabilistic neural networks and radial basis 
function networks 


80 


Neurodynamics I 


32 


Pattern recognition II 


81 


Applications I 


33 


Supervised learning II 


82 


Applied industrial manufacturing 


34 


Image processing I 


83 


Applications II 


35 


Learning and memory II 


84 


Architecture IV 


36 


Hybrid systems HI 


85 


Optimization HI 


37 


Artificially intelligent networks II 


86 


Applications in image recognition II 


38 


Fast learning for neural networks 


87 


Unsupervised learning HI 


39 


Industry application of neural networks 


88 


Supervised learning VHI 


40 


Neural systems hardware II 


89 


Neurodynamics II 


41 


Image processing II 


90 


Computational intelligence 


42 


Nonlinear PCA neural networks 


91 


Optimization using Hopfield networks 


43 


Intelligent control HI 


92 


Supervised learning IX 


44 


Pattern recognition HI 


93 


Applications to communications 


45 


Supervised learning HI 


94 


Applications HI 


46 


Applications in power 


95 


Unsupervised learning IV 


47 


Time series prediction and analysis II 


96 


Optimization IV 


48 

49 


Learning and memory HI 
Intelligent robotics 


97 


Applications 
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all interference for every round of sequential training. This number is a good 
indicator of how much retraining is needed to store all of the sequences that 
have been sequentially presented to the system. Also, we record the number of 
intact uninterfered (intact) sequences right after the acquisition of the latest 
sequence. Figure 3 shows the number of intact sequences and the number of 
retraining sweeps plotted against training rounds. 




number of rounds 



Fig. 3. The number of intact seqnences and the nnmber of retraining sweeps with 
respect to training rounds during training with the Table 1 database. 



Several conclusions can be drawn from this simulation result. The first and 
the most important conclusion is that the number of retraining sweeps seems 
independent of the size of the previous memory. The overall curve for retraining 
sweeps remains flat, even though there is large variation across different training 
rounds. We view this result as particularly significant because it suggests that, 
in the anticipation model, the amount of interference caused by learning a new 
sequence does not increase with the number of previously memorized sequen- 
ces. This conclusion not only conforms intuitively with human performance of 
long-term learning, but also makes the model feasible to provide a reliable se- 
quential memory that can be incrementally updated later on. Hence, new items 
can be incorporated into the memory without being limited by those items al- 
ready in the memory. We refer to this property of sequential learning/memory 
as capacity-independent incremental learning /memory. The anticipation model 
exhibits this property because a learned sequence leaves its traces across the 
network, involving a set of distributed associations between subsequences and 
context detectors (see Fig. 1). On the other hand, each context detector stores its 
context locally. When a new sequence is learned, it employs a group of context 
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detectors, some of which may have been committed, thus causing interference. 
But as the sequential memor y becomes large, so is the number of context de- 
tectors. Out of these detectors, only a certain number of them will be interfered 
as a result of learning a new sequence. The number of interfered detectors tends 
to relate to the new sequence itself, not the size of the sequential memory . Con- 
trasting capacity-independent incremental learning , catastrophic interference 
would require simultaneous retraining of the entire memory when a new item is 
to be learned. The cost of retraining when catastrophic interference occurs can 
be prohibitive if the size of memory is not so small. Ruiz de Angulo and Torras 
(1995) presented a study on sequential learning of multilayer perceptrons, and 
reported that their model can learn sequentially several most recent patterns. 
Although it is a better result than original multilayer perceptrons, their model 
appears unable to support a sizable memory. The high variations in the number 
of retraining sweeps are caused by the overlaps between the stored sequences. 
For long overlapping subsequences, many sweeps may be needed to resolve the 
interference caused by overlaps. 

The second conclusion is that the number of intact sequences increases with 
rounds of sequential training approximately linearly. This is to be expected given 
the result on the amount of retraining. Again, there are considerable variations 
from one round to another. Since interference is caused by the overlap between 
a new sequence and the stored sequences, another way of looking at this result is 
the following. As the memory expands, relatively the fewer items in the memory 
will overlap with the new sequence. 

Figure 4 illustrates the detailed retraining process during round 96. Right 
after the model has learned Sge of Table 1, the only interfered sequence is S' 14 . 
After S '14 is retrained, Sgg is interfered and has to be retrained. This finishes the 
first retraining cycle for round 96. During the second cycle, S 71 is found to be 
interfered. After S 71 is retrained, Sge is interfered again and has to be retrained. 
The retraining with S 71 alternates with that of Sge for three more cycles. Notice 
the large overlap between S 71 : OPTIMIZATIONS and Sgg: ^^OPTIMIZATION 
IV” . In cycle 6 , several more sequences are interfered. After retraining them 
sequentially, all of the first 96 sequences have been learned successfully. 

To examine the amount of detector sharing, we compare the detector use in 
the anticipation model with one in which no detector is shared by different com- 
ponents (see for example Wang & Arbib, 1993). Without detector sharing, the 
number of detectors needed to acquire all of the 97 sequences is — 1), 

which equals 2520. As stated earlier, the anticipation model needs 1,088 detec- 
tors to learn all of the sequences. Thus, the detector sharing in the anticipation 
model cuts required context detectors by nearly a factor of 2.5. 

5 Context Learning by Chunking 

Even though the amount of retraining does not depend on the number of stored 
sequences in the memory, retraining can still be expensive. As shown in Fig. 3, 
it usually takes dozens of retraining sweeps to fully incorporate a new sequence. 




70 



D. Wang 





Si 4 




96 






' n n n 






















































; n 




















n n 


1 






n n 






































n 






























w 






























7 










s 


_! [ 






_L 


[ 


. 




1 2 3 

retraining cycle 1 








s?i 






S96 

: n 












n n 




n ; n 






; n 




n 1 










n 




tT h 








n n 


n 




n 'n 


1 1 




•n 






1 










' n 










: n 






n 








n 




rl 


r 






d 


[ 




1 2 








4 



retraining cycle 3 



S71 


S96 














1 rt^ h rt^ 


1 rt^ h rt" 






















d L 


d [ 



retraining cycle 2 



S71 


S96 














1 rt^ h rt^ 


1 rt^ h rt^ 


n 


n 


















d L 


d [ 



retraining cycle 4 



c 


S7I 

n n 


n 


n 








S71 

n 1 n 


S96 

n 1 n 


E 

G 

H 

L 

M 

N 

0 

P 

R 

S 


n n 




n n 




nn n □' nn n n 


nn n n> nn n n 






n ; n 


n ; n 


n n n n n n 


n n n n 


n ' n 


n ' n 






1 n n n 


1 n n n 


n n 


n 


n ' n 


n ' n 


n n n 


n 


n n inn 


n n inn 


in n n 


1 n 


n ' 


' n 


n 


n 


n ; n 


n 1 n 


n 




n 1 n 


n 1 n 






d [ 


d [ 


n n n 


n n 


1 2 3 4 • u 

retraining cycle 5 V 

Z 










n 


n 


n n 


n 


□ [ 



retraining cycle 6 





1 n 


S72 

d n n 




78 


s 

n 


88 

n 




n 


n n 


n 


n 
















n 


n 






n n 


m 


m 


n n n 


n n n 






n n 






n 


n 








n 


n 








n n n m 


n n n n n n 


n 


rm 


n n m 


n n rm 




n 


n 






n 


n 






n 












n n 


n n n n 






n n 


n n 




n 


n n n 












m 


m 


n 


n 


n 


n 






n 






n n 


n n 




n 


n 


1 


1 


1 n 


n n 




n 


n n 




















n 


n 












n n 


n n 


















n n 


n n n 


n 


n 


n n 


n n 


tf 


[ 


[ 


C 


[ 




[ 



retraining cycle 6 



Fig. 4. Retraining process during round 96 of Fig. 3. Each row represents the activity 
trace of an input terminal indicated by the corresponding input symbol. White boxes 
represent correctly anticipated terminal activities in sequence generation, whereas gray 
boxes represent terminal activities which do not match anticipated ones. Solid vertical 
lines separate training sweeps of different sequences, and dash vertical lines separate 
training sweeps of the same sequence. Cycle numbers are indicated under each panel. 
Time runs from left to right. 





Anticipation Model for Sequential Learning of Complex Sequences 



71 



Our analysis of the model indicates that repeated sweeps are usually caused by 
the need to commit a new detector and gradually expand the context of that 
detector in order to resolve ambiguities caused by long recurring subsequences. 
For example, consider the situation that the system has stored the following two 
sequences 

Sa. ''JOE LIKES 0” 

Sd-. "JAN LIKES r, 
and the network is to learn the sequence 

Se'. "DEB LIKES T . 

There is an overlapping subsequence between Sc and Sd- "LIKES’ . A common 
situation before Se is learned is that there are two detectors, say ui and U 2 , 
tuned to the contexts of "E LIKES’ and "N LIKES’ , respectively. While Sg is 
being trained, neither ui nor U 2 can be activated and a new detector, say u^, 
must be committed to anticipate "2” {u^ needs to recognize only "S” since "S” 
cannot activate either ui or U 2 ). Suppose now the network is to learn yet another 
sequence, 

Sf. "DIK LIKES 3”. 

Se and Sf will take turns to capture U 3 and gradually increase its degree until 
U 3 can detect either "B LIKES’ or "K LIKES’ . Eventually, another detector, 
say Ui, will be committed for the other sequence. This gradual process of degree 
increment is the major factor causing numerous retraining sweeps. 

The above observation has led to the following extended model for cutting the 
amount of retraining. The basic idea is to incorporate a chunking mechanism so 
that newly committed detectors may expand their contexts from chunks formed 
previously, instead of from the scratch. 

The extended model consists of a dual architecture, shown in Figure 5. The 
dual architecture contains a generation network (on the left of Fig. 5), which is 
almost the same as the original architecture (see Fig. 1), and another similar 
network, called the chunking network (on the right of Fig. 5). The two networks 
are mutually connected at the top. The chucking network does not produce 
anticipation, and thus does not need a layer of modulators. Because of this, the 
detectors in this network do not increase their degrees by a mismatch. Besides, 
the chunking network mirrors every process occurring in the generation network. 

At any time step during training, there is a pair of winning detectors in the 
dual architecture, each corresponding to one network. The algorithm is designed 
so that the winning detector of the chunking network has a degree less by 1 
than the degree of the winning detector of the generation network. In addition, 
a newly committed detector of the generation network may take a degree which 
is 1 plus the degree of the activated chunk detector. We refer to the detectors 
of the chunking network as chunk detectors. The interaction between the two 
networks takes place via the two-way connections between the two networks 
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generation network chunking network 




Fig. 5. Dual architecture of the extended anticipation model for chunking. The archi- 
tecture consists of two mutually connected networks: the generation network and the 
chunking network. The connections to and from global input and output units have 
fixed weights. See the caption of Fig. 1 for other notations. 



(Fig. 5). The introduction of the chunk detectors can speed up learning when a 
subsequence (a chunk) occurs multiple times in the input flow. More specifically, 
assume that a context detector Ui of degree d has learned to recognize a context, 
and a corresponding chunk detector Uii of degree d—1 has learned a chunk which 
is a right subsequence of the context learned by Ui . If the chunk occurs at least 
twice, then there will be a time when Uii is activated but Ui is not. Through 
the learning process an uncommitted context detector, say Uj, is activated (thus 
committed). Instead of starting from degree 1, Uj starts its degree at the value of 
d, leading to a significant reduction of training/retraining sweeps. The speedup 
in sequential training comes with a cost. Obviously, the addition of another 
network - the chunking network - adds both to the size of the overall network and 
to additional computing time. The formal description of the chunking network 
is given in Wang and Yuwono (1996). 

Before presenting simulation results with chunking, we explain using the 
earlier example how the dual architecture helps speed up training. After training 
with Sc and Sd, there will be a context detector tuned to either “E LIKES’ or 
“N LIKES’ . This, in turn, will lead to the formation of the chunk “LIKES’; 
that is, a chunk detector will be tuned to “LIKES’ since the detector has a 
degree that is one less than that of the corresponding context detector. After 
the chunk is formed, Se can be acquired easily - the detector that is trained to 
associate with ”2” can obtain its appropriate degree in just one sweep, thanks 
to the formation of the chunk “LIKES’ . Similarly, S / can be acquired quickly. 

To show the effectiveness of the chunking network, we present simulations 
using the same sequences (Table 1) and the same procedure as in the previous 
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section. To complete the training, the dual architecture requires 1,234 context 
detectors as compared to 1,088 without chunking, and 436 chunk detectors. 
Other parts of the network are the same as used in the previous section. Figure 
6 gives the result. A comparison between Fig. 6 and in Fig. 3 shows that the 
former ends up with only a few more intact sequences. This indicates that, in the 
dual architecture, later training causes almost the same amount of interference. 
On the other hand, when the chunking is incorporated, the number of retraining 
sweeps on the whole is cut dramatically. The total number of retraining sweeps 
during the entire training is 1,104 when the chunking network is included. This 
is compared to 3,029 without the chunking network, hence a reduction of the 
overall amount of sequential training almost by three-fold. 




Fig. 6. The number of intact seqnences and the nnmber of retraining sweeps with 
respect to training rounds during training with the Table 1 database with the dual 
architecture. 



For a comparison with Fig. 4 which illustrates the retraining process of ro- 
und 96, Figure 7 shows the detailed retraining process during round 96 using 
the dual architecture. Right after Sge is learned, three sequences are interfered: 
>5'59, Sqo and Sse- After S'59: “MACHINE VISION I” is retrained, Sqq: “MA- 
CHINE VISION II” can be correctly generated without further retraining. This 
interesting situation arises because the two sequences have a large overlap and 
it is the overlapping part that is interfered during training Sg&- Thus, when the 
overlapping part is regained during retraining with Asg, both S'59 and Seo can 
be recalled correctly. The system needs another sweep to regain Sgg. After the 
first retraining cycle, all of the first 96 sequences have been acquired. 
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Fig. 7. Retraining process during round 96 of Fig. 6. The interference with Sqq is 
eliminated as a result of retraining Ssg. See the caption of Fig. 4 for notations. 



6 Further Discussion 

The idea of anticipation-based learning seems to be consistent with psychological 
evidence on human learning of sequential behaviors. Meyer (1956) noted that 
expectation is key to music cognition. When a temporal sequence is repeatedly 
presented to subjects, according to Nissen and Bullemer (1987), the reaction to 
a particular component in the sequence becomes faster and faster, and the reac- 
tion time to a component in a repeated sequence is much shorter than when it 
occurs in random sequences. The latter finding rules out the possibility that the 
reduction in reaction time is due to the familiarity with a component. These fin- 
dings have been confirmed by later experiments (Willingham et ah, 1989; Cohen 
et ah, 1990). The results suggest that the subjects have developed with practice 
some form of anticipation before a particular component actually occurs in the 
sequence. It is observed that in learning temporal sequences human subjects can 
even be explicitly aware of the temporal structure of a sequence, and predict 
what comes next in the sequence (Nissen & Bullemer, 1987; Willingham et ah, 
1989; Curran & Keele, 1993). 

As analyzed earlier, the anticipation model does not suffer from catastrophic 
interference. When multiple sequences are presented to the model sequentially, 
some degree of interference occurs. But this kind of interference can be overcome 
by retraining the interfered sequences. Extensive computer simulations indicate 
that the amount of retraining does not increase as the number of sequences stored 
in the model increases. The anticipation model is characteristic of capacity- 
independent incremental learning during sequential training. These results, plus 
the fact that interference is caused by the overlap between a new sequence and 
stored sequences, suggest that the behavior of the model in sequential learning 
resembles aspects of retroactive interference. 

After a sequence S is learned, it can be generated by its beginning component 
- the sequence identifier. Partial sequence generation can also be elicited by a 
subsequence of S. If a sufficient subsequence is presented, the rest of S can be 
generated entirely. Partial generation may stop before the rest of the sequence 
is completed. For example, after the sequence: X-A-B-C-D-E-A-B-C-D-F is 
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learned, the presentation of A activates the subsequence B-C-D, but not the 
rest. The anticipation model exhibits partial generation because a sequence is 
stored as a chain of associations, each of which is triggered by a context, or a 
subsequence. This property of the model is consistent with our experience that 
we are able to pick up a familiar song, a melody, or an action sequence (like Tai 
Chi) from the middle. 

As shown in Sect. 5, the model’s capability of chunking repeated subsequences 
within a sequence and between sequences substantially reduces the amount of 
retraining and improves the overall efficiency of learning. Without chunking, 
recurring subsequences must be learned from the scratch. The basic idea behind 
current chunking is to learn a recurring subsequence just once and store it as a 
chunk, so that the next time the subsequence occurs the model can simply use 
the chunk as a basic component. Chunking is a fundamental characteristic of 
human information processing (Miller, 1956; Simon, 1974). We note that, though 
the present model has addressed some aspects of chunking, the general issue of 
automatic chunking is very challenging and remains an open problem. What 
constitutes a chunk? A chunk is often taken to be a meaningful subsequence 
(such as a word), but it may also be just a convenient way of breaking a long 
sequence into shorter subsequences for facilitating further processing. In the 
anticipation model, a chunk corresponds to a repeated subsequence. This is a 
reasonable definition in the present context. The model, through its mechanism 
of context learning , provides a neural network basis for forming such chunks. 
On the other hand, this definition of a chunk does not capture the richness 
of general chunking. Chunking depends critically on the STM capacity (Miller, 
1956). Furthermore, different people, may have different ways of chunking the 
same sequence in order to overcome STM limitations and memorize the sequence. 
Chunking also depends on general knowledge and custom. For example, we tend 
to chunk a 10-digit telephone number in the U.S. into three chunks: the first three 
digits corresponding to an area code, then the next three digits to a district code, 
and the last four digits. However, the same 10-digit number may well be chunked 
differently in another country. 

Proposition 4 and the property of capacity-independent incremental learning 
together enable the anticipation model to perform long-term automatic learning 
of temporal sequences. The system is both adaptive and stable, and its long-term 
memory capacity increases gradually as learning proceeds. Thus, the anticipation 
model provides a sequential memory , which can store and recall a large number 
of complex sequences . 



7 Summary 

In this chapter, we have presented the anticipation model - a neural network mo- 
del - that learns and generate complex temporal sequences. In the anticipation 
model, sequences are acquired by one-shot learning that obeys a normalized 
Hebbian learning rule, in combination with a competitive mechanism realized 
by a winner-take-all network. During learning and generation, the network ac- 
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tively anticipates the next component on the basis of a previously anticipated 
context. A mismatch between the anticipation and the actual input triggers self- 
organization of context expansion. Analytical results on the anticipation model, 
presented in Sect. 3, ensure that the model can learn to generate any complex 
sequences . Multiple sequences are acquired by the model in an incremental fas- 
hion, and large-scale simulation results strongly suggest that the model exhibits 
capacity-independent incremental learning . As a result, the anticipation mo- 
del provides an effective sequential memory . In addition, by incorporating a 
form of chunking we have demonstrated significant performance improvement in 
learning many sequences that have significant overlaps. 

Finally, the anticipation model argues (see also Wang & Arbib, 1993) from a 
computational perspective for the chaining theory of temporal behavior, which 
was rejected by Lashley (1951) but supported by recent psychological studies of 
serial order organization (Murdock, 1987; Lewandowsky & Murdock Jr., 1989). 
Simple associative chaining between adjacent sequence components is too limi- 
ted to be true. However, if chaining between remote components and chunking 
of subsequences into high-order components are allowed, the basic idea of asso- 
ciative chaining can give rise to much more complex temporal behaviors, going 
much beyond what was realized by Lashley (1951). The anticipation model shows 
how learning and generation of complex temporal sequences can be achieved by 
self-organizing in a neural network. 
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1 Introduction 

Gonnectionist models for learning in sequential domains are typically dynami- 
cal systems that use hidden states to store contextual information. In principle, 
these models can adapt to variable time lags and perform complex sequential 
mappings. In spite of several successful applications (mostly based on hidden 
Markov models), the general class of sequence learning problems is still far from 
being satisfactorily solved. In particular, learning sequential translations is ge- 
nerally a hard task and current models seem to exhibit a number of limitations. 
One of these limitations, at least for some application domains, is the causality 
assumption. A dynamical system is said to be causal if the output at (discrete) 
time t does not depend on future inputs. Gausality is easy to justify in dyna- 
mics that attempt to model the behavior of many physical systems. Glearly, in 
these cases the response at time t cannot depend on stimulae that the system 
has not yet received as input. As it turns out, non-causal dynamics over infinite 
time horizons cannot be realized by any physical or computational device. For 
certain categories of finite sequences, however, information from both the past 
and the future can be very useful for analysis and predictions at time t. This is 
the case, for example, of DNA and protein sequences where the structure and 
function of a region in the sequence may strongly depend on events located both 
upstream and downstream of the region, sometimes at considerable distances. 
Another good example is provided by the off-line translation of a language into 
another one. Even in the so-called “simultaneous” translation, it is well known 
that interpreters are constantly forced to introduce small delays in order to ac- 
quire “future” information within a sentence to resolve semantic ambiguities and 
preserve syntactic correctness. 

Non-causal dynamics are sometimes used in other disciplines (for example, 
Kalman smoothing in optimal control or non-causal digital filters in signal pro- 
cessing). However, as far as connectionist models are concerned, the causality 
assumption is shared among all the types of models which are capable of map- 
ping input sequences to output sequences, including recurrent neural networks 
and input-output HMMs (lOHMMs) (Bengio & Frasconi, 1996). In this paper, 
we develop a new family of non-causal adaptive architectures where the underly- 
ing dynamics are factored using a pair of chained hidden state variables. The two 
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chains store contextual information contained in the upstream and downstream 
portions of the sequence, respectively. The output at time t is then obtained by 
combining the two hidden representations. Interestingly, the same general me- 
thodology can be applied to many different classes of graphical models for time 
series, such as recurrent neural networks, lOHMMs, tree structured HMMs, and 
switching state space models (Ghahramani & Jordan, 1997). For concreteness, 
however, in the rest of this paper we focus exclusively on lOHMMs and recurrent 
neural networks. 

The main motivation of this work is an application to the problem of protein 
secondary structure (SS) in molecular biology. The task can be formulated as the 
translation of amino acid input strings into corresponding output strings that 
describe an approximation of the proteins’ 3D folding. This is a classic problem in 
bioinformatics which has been investigated for several years under the machine 
learning perspective, and for which significant performance improvements are 
still expected. Protein SS prediction can be formulated as the problem of learning 
a synchronous sequential translation, from strings in the amino acid alphabet 
to strings in the SS alphabet. The task is thus a special form of grammatical 
inference (Angluin & Smith, 1983). Nonetheless, to the best of our knowledge no 
successful applications of grammatical inference algorithms (neither symbolic, 
neither based on connectionist architecture) have been reported. Instead, the 
current best predictors are based on feedforward neural networks fed by a fixed- 
width window of amino acids, which by construction cannot capture relevant 
information contained in distant regions of the protein. Our proposal is motivated 
by the assumption that both adaptive dynamics and non-causal processing are 
needed to overcome the drawbacks of local fixed-window approaches. While our 
current system achieves an overall performance exceeding 75% correct prediction 
(at least comparable to the best existing systems) the main emphasis here is on 
the development of new algorithmic ideas. 

The chapter is organized as follows. In Section 2, we shortly review the lite- 
rature on protein SS prediction. In Sections 3 and 4, we introduce the two novel 
non-causal architectures: bidirectional lOHMMs (BIOHMMs) and bidirectional 
RNNs (BRNNs). In Section 5, we describe the protein datasets used in the ex- 
perimental evaluation of the proposed system. Finally, in Section 6 we report 
preliminary prediction results on the SS prediction task using our best system 
which is based on ensembles of BRNNs. 



2 Prediction of Protein Secondary Structure 

Proteins are polypeptides chains carrying out most of the basic functions of life 
at the molecular level. The chains can be viewed as linear sequences over the 
20-letter amino acid alphabets that fold into complex 3D structures essential to 
their function. One step towards predicting how a protein folds is the prediction 
of its secondary structure. The secondary structure consists of local folding regu- 
larities often maintained by hydrogen bonds, and traditionally subdivided into 
three classes: alpha helices, beta sheets, and coils, representing all the rest. The 
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sequence preferences and correlations involved in these structures have made 
secondary structure prediction one of the classical problems in computational 
molecular biology. Moreover, this is one application where machine learning me- 
thods, particularly neural networks, have had considerable impact yielding the 
best performing algorithms to date (Rost & Sander, 1994). 

The basic architecture used in the early work of Qian and Sejnowski (1988) is 
a fully connected MLP with a single hidden layer that takes as input a local fixed- 
size window of amino acids (the typical width is 13), centered around the residue 
for which the secondary structure is being predicted. A significant improvement 
was obtained by cascading the previous architecture with a second network to 
clean up the output of the lower network. The cascaded architecture reached a 
performance of Qs = 64.3%, with the correlations Ca = 0.41, = 0.31, and 

Cj = 0.41 — see (Baldi et ah, 1999) for a review of the standard performance 
measures used in this chapter. Although this approach has proven to be quite 
successful, using a local fixed width window has well known drawbacks. First, 
the size of the input window must be chosen a priori and a fair choice may be 
difficult. Second, the number of parameters grows with the window size. This 
means that permitting certain far away inputs to exert an effect on the current 
prediction is paid in terms of parametric complexity. Hence, one of the main 
dangers of the Qian and Sejnowski’s architectures is the overfitting problem. 

Most of the subsequent work on predicting protein secondary structure using 
NNs has been based on architectures with a local window, although a lot of 
effort has been put on devising several improvements. Rost and Sander (1993b, 
1993a) started with Qian and Sejnowski’s architecture, but used two methods 
to address the overfitting problem. First, they used early stopping. Second, they 
used ensemble averages (Hansen & Salamon, 1990; Krogh & Vedelsby, 1995) by 
training different networks independently, using different input information and 
learning procedures. But the most significant new aspect of their work is the 
use of multiple alignments, in the sense that profiles (i.e. position-dependent 
frequency vectors derived from multiple alignments), rather than raw amino 
acid sequences, are used in the network input. The reasoning behind this is that 
multiple alignments contain more information about secondary structure than 
do single sequences, the secondary structure being considerably more conserved 
than the primary sequence. Although tests made on different data sets can be 
hard to compare, the method of Rost and Sander, which resulted in the PHD 
prediction server (Rost & Sander, 1993b, 1993a, 1994), still reaches the top levels 
of prediction accuracy (Q 3 = 72%, measured using 7-fold cross validation). In 
the 1996 Asilomar competition CASP2, the PHD method reached Q 3 = 74% 
accuracy, thus performing much better than virtually all other methods used for 
making predictions of secondary structure. 

Another interesting recent NN approach is the work of Riis and Krogh (1996), 
who address the overfitting problem by careful design of the NN architecture. 
Their approach has four main components. First, they reduce the number of free 
parameters by using an adaptive encoding of amino acids, that is, by letting the 
NN find an optimal and compressed representation of the input letters. Second, 
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the authors design a different network for each of the three classes, using bio- 
logical prior knowledge. For example, in the case of alpha-helices, they exploit 
the helix periodicity by building a three-residue periodicity between the first 
and second hidden layers. Third, Riis and Krogh use ensembles of networks and 
filtering to improve the prediction. The networks in each ensemble differ, for 
instance, in the number of hidden units used. Finally, the authors use multi- 
ple alignments together with a weighting scheme. Instead of profiles, for which 
the correlations between amino acids in the window are lost, predictions are 
made first from single sequences and then combined using multiple alignments. 
Most important, perhaps, the basic accuracy achieved is Q 3 = 66.3% when using 
seven-fold cross-validation on the same database of 126 non-homologous proteins 
used by Rost and Sander. In combination with multiple alignments, the method 
reaches an overall accuracy of Q 3 = 71.3%, and correlation coefficients correla- 
tions Ca = 0.59, Cf 3 = 0.50, and = 0.41. Thus, in spite of a considerable 
amount of architectural design, the final performance is practically identical to 
(Rost & Sander, 1994). More recently. Cuff and Barton (1999) have compared 
and combined the main existing predictors. On the particular data sets used in 
their study, the best isolated predictor is still PHD with Q 3 = 71.9%. 

A more detailed review of the secondary structure prediction problem and 
corresponding results can be found in (Baldi & Brunak, 1998). The important 
information however is that there is an emerging consensus of an accuracy upper 
bound, slightly above 70-75%, to any prediction method based on local infor- 
mation only. By leveraging evolutionary information in the form of multiple 
sequence alignments, performance seems to top at the 72-74% level, in spite of 
several attempts with sophisticated architectures. Thus it appears today that 
to further improve prediction results one must use distant information, in se- 
quences and alignments, which is not contained in local input windows. This 
is particularly clear in the case of beta sheets where stabilizing bonds can be 
formed between amino acids far apart. Using long-ranged information, however, 
poses two formidable related challenges: (1) avoiding overfitting related to large- 
input-window MLPs (2) being able to detect the sparse and weak long-ranged 
signal and combine it with the significant local information, while ignoring the 
bulk of less relevant distant information. 

The limitations associated with the fixed-size window approach can be miti- 
gated using other connectionist models for learning sequential translators, such 
as recurrent neural networks (RNNs) or input-output hidden Markov models 
(lOHMMs). Unlike feedforward nets, these models employ state dynamics to 
store contextual information and they can adapt to variable width temporal 
dependencies. Unfortunately there are theoretical reasons suggesting that, de- 
spite an adequate representational power, RNNs cannot possibly learn to cap- 
ture long-ranged information because of the vanishing gradient problem (Bengio 
et ah, 1994). However, it is reasonable to believe that RNNs fed by a small win- 
dow of amino acids can capture some distant information using less adjustable 
weights than MLPs. The usual definition of RNNs only allows “past” context 
to be used but, as it turns out, useful information for prediction is located both 
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downstream and upstream of a given residue. The architectures described in the 
next sections remove these limitations. 

3 lOHMMs and Bidirectional lOHMMS 

3.1 Markovian Models for Sequence Processing 

Hidden Markov models (HMMs) have been introduced several years ago as a 
tool for probabilistic sequence modeling. The interest in this area developed 
particularly in the Seventies, within the speech recognition research community, 
concerned at that time with the limitations of template-based approaches such 
as dynamic time-warping. The basic model was very simple, yet so flexible and 
effective that it rapidly became extremely popular. During the last years a large 
number of variants and improvements over the standard HMM have been pro- 
posed and applied. Undoubtedly, Markovian models are now regarded as one of 
the most significant state-of-the-art approaches for sequence learning. Besides 
speech recognition (see e.g. (Jelinek, 1997) for more recent advances), impor- 
tant application of HMMs include sequence analysis in molecular biology (Baldi 
et ah, 1994; Krogh et ah, 1994; Baldi & Chauvin, 1996; Baldi & Brunak, 1998), 
time series prediction (Andrew & Dimitriadis, 1994), numerous pattern recogni- 
tion problems such as handwriting recognition (Bengio et ah, 1995; Bunke et ah, 
1995), and, more recently, information extraction (Freitag & McCallum, 2000). 
HMMs are also very closely related to stochastic regular languages, making them 
interesting in statistical natural language processing (Charniak, 1993). The re- 
cent view of the HMM as a particular case of Bayesian networks (Bengio & 
Frasconi, 1995; Lucke, 1995; Smyth et ah, 1997) has helped the theoretical un- 
derstanding and the ability to conceive extensions to the standard model in a 
sound and formally elegant framework. 

The basic data object being considered by the standard model is limited 
to a single sequence of observations (which may be discrete or numerical, and 
possibly multivariate). Internally, standard HMMs contain a single hidden state 
variable Xt which is repeated in time to form a Markov chain (see Figure 1). The 
Markov property states that Xt+\ is conditionally independent of Ai, . . . , Xt-\ 
given Xt- Similarly, each emission variable Yt is independent of the rest given 
Xt- These conditional independence assumptions are graphically depicted using 
the Bayesian network of Figure 1. More details about the basic model can be 
found in (Rabiner, 1989). 

Some variants of standard HMMs are conceived as extensions of the internal 
structure of the model. For example, factorial HMMs (Ghahramani & Jordan, 
1997) contain more than just one hidden state variable (see for example the 
middle of Figure 1) and can also introduce complex probabilistic relationships 
amongst state variables. However, the data interface towards the external world 
essentially remain the same and the basic data objects being processed are still 
single sequences. 

Another direction that can be explored to extend the data types that can 
be dealt with Markovian models is the direction of modeling. A standard HMM 
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Standard HMM 



^ Factorial HMM 

(with 2 state variables) 





Fig. 1. Bayesian networks for standard, factorial, and input-output HMMs. 



is a generative model. When used in classification tasks (as in isolated word 
recognition), one model per class is introduced and each model is typically trai- 
ned on positive examples only. This means that no competition among classes is 
introduced when estimating the parameters and there is no associative probabi- 
listic mapping from a sequence to a class variable. Learning is thus unsupervised 
because the model is trained to estimate an unconditional probability density 
(in a sense, we might say that a weak form of supervision occurs because class 
membership of training examples is only used to select the model to which the 
sequence is presented but each model is not aware of other classes). A fully super- 
vised method for training HMMs has been early proposed by Brown (1987). The 
method modifies the function to be optimized and, instead of using maximum 
likelihood estimation, it relies on maximum mutual information (MMI) between 
the model and the class variable. In this way, the set of models employed in 
a classification task allows to estimate the conditional probability of the class 
given the input sequence, thus effectively introducing supervision in the learning 
process, as noted by Bridle (1989) and Bengio & Frasconi (1994). However, the 
MMI approach only allows to associate a single output variable to the input 
sequence. A more general supervised learning problem for sequences consists of 
associating a whole output sequence to the input sequence. This is the common 
sequence learning setting when using other machine learning approaches, such 
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as recurrent neural networks. This extension cannot be achieved by a modifica- 
tion of the optimization procedure but requires a different architecture, called 
input-output HMM (lOHMM). 

As shown at the bottom of Figure 1, three classes of variables are considered. 
The emission Yt is called in this case output variable. Hidden states in lOHMMs 
are conditionally dependent on the input variable Ut. In this way, state transition 
probabilities, although stationary, are non-homogeneous because the probability 
of transition is controlled by the current input. The resulting model is thus si- 
milar to a stochastic translating automaton (i.e. it can be used to estimate the 
conditional probability of the output sequence given the input sequence), but 
lOHMMs can also deal with continuous input and output variables. This can 
be easily achieved by employing feedforward neural networks as models of the 
conditional probabilities P{Yt\Xt,Ut) and P{Xt\Xt-i,Ut) that define the pa- 
rameterization of the model. This technique is described in detail in (Baldi & 
Chauvin, 1996) and in (Bengio & Frasconi, 1996), along with details about infe- 
rence and learning algorithms. These topics are summarized later on in Sections 
3.3 and 3.4 for the more general case of the bidirectional architecture presented 
in this paper. 

3.2 The Bidirectional Architecture 

A bidirectional lOHMM is a non-causal model of a stochastic translation defined 
on a space of finite sequences. Like lOHMMs, the model describes the conditional 
probability distribution P{Y\U), where U = Ui, C/ 2 , • ■ ■ ,Ut is a, generic input 
sequence and Y = Yi,Y 2 , ■ ■ ■ ,Yt the corresponding output sequence. Although 
in the protein application described below both U and Y are symbolic sequences, 
the theory holds for sequences of continuous (possibly multivariate) sequences 
as well. The model is based on two Markov sequences of hidden state variables, 
denoted by F and B, respectively. For each time step. Ft and Bt are discrete 
variables with realizations (states) in {/^, • • • , /”} and {b^, • • • , &"*}, respectively. 
As in HMMs, F) is assumed to have a causal impact on the next state Fi+i. 
Hence, Ft stores contextual information contained on the left of t (propagated 
in the forward direction) . Symmetrically, Bt is assumed to have a causal impact 
on the state Bt-i, thus summarizing information contained on the right of t 
(propagated in the backward direction). 

As in other Markov models for sequences, several conditional independence 
assumptions are made, and can be described by a Bayesian network as shown in 
Fig. 2. In particular, the following factorization of the joint distribution holds: 

T 

P{Y,U,F,B) = l[P{Yt\Ft,Bt,Ut)P{Ft\Ft_^,Ut)P{Bt\Bt+i,Ut)P{Ut) (1) 

t=i 

Two boundary variables, Bt+i and Fq are needed to complete the definition of 
the model. For simplicity we assume these variables are given, i.e. P{Bt+i = 
b^) = P{Fq = /^) = 1, although generic (trainable) distributions could be 
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Fig. 2. Bayesian networks for the bidirectional lOHMM. 



specified. The suggested architectures can be viewed as a special form of factorial 
lOHMMs (with obvious relationships to factorial HMMs and hidden Markov 
decision trees (Ghahramani & Jordan, 1997)) where the state space is factorized 
into the state variables Ft and Bt- 



3.3 Parameterization 

The parameters of a Bayesian network specify the local conditional distribution 
of each variable given its parents. In the case of BIOHMMs, the local conditional 
distributions are P{Yt\Ft, Bt,Ut), P{Ft\Ft_i,Ut), and P{Bt\Bt+i,Ut). Uncondi- 
tional distributions for root nodes (i,e, P{Ut)) do not need to be modeled if we 
assume that there are no missing data in the input sequences. A quite common 
simplification is to assume that the model is stationary, i.e. the above conditio- 
nal distributions do not vary over time. Stationarity can be seen as a particular 
form of parameter sharing that significantly reduces the degrees of freedom of 
the model. In the discrete case, parameters can be explicitly represented using 
conditional probability tables. Unfortunately the tables can become very large 
when nodes have many parents, or variables have large state spaces. Hence, 
a more constrained reparameterization is often desirable and can be achieved 
using the neural network techniques. In (Baldi & Chauvin, 1996), the general 
approach is demonstrated in the context of HMMs for protein families using, for 
the emission probabilities, a single hidden layer shared across all HMM states. 
In the case of BIOHMMs, the approach can be extended by introducing three 
separate feedforward neural networks for modeling the local conditional proba- 
bilities P{Bt\Bt+i,Ut),P{Ft\Ft-i,Ut),P{Yt\Ft,Bt,Ut). Alternatively, a modular 
approach using a different MLP for each state can be pursued (Baldi et ah, 
1994; Bengio & Frasconi, 1996). The modular approach is also be possible with 
BIOHMMs, although in this case the number of subnetworks would become 
n + m + nm (one subnetwork for each state 6*, P , and one for each pair (6*, /^)). 
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3.4 Inference and Learning 

The basic theory for inference and learning in graphical models is well establis- 
hed, and can readily be applied to the present architecture. For conciseness, 
we focus on the main aspects only. A major difference between BIOHMMs 
and lOHMMs or HMMs is that the Bayesian network for BIOHMMs is not 
singly connected. Hence direct propagation algorithms such as Pearl’s algorithm 
(1988) cannot be used for solving the inference problem. Rather, we adopt the 
general junction tree algorithm (Jensen et ah, 1990). Given the regular struc- 
ture of the network, the junction tree can be constructed by hand. Cliques are 
{Ut,Ft,Bt,Yt},{Ut,Ft,Bt,Ft-i}, and {Ut,Ft,Bt,Bt+i}. Assuming Bt and Ft 
have the same number of states (i.e., n = m) space and time complexities are 
0{KTn^), where K is the number of input symbols {K = 20 in the case of pro- 
teins). However, it should be noted that often in a sequence translation problem 
input variables are all observed (this is the case, at least, in the protein problem) 
and thus we know a priori that the nodes Ut always receive evidence, both in the 
learning and recall phases. Therefore, we can reduce the complexity by a factor 
K since only those entries which are known to be non-zero need to be stored and 
used in the absorption computations. In the case of proteins, this simple trick 
yields a speed up factor of about 20, the size of the input amino acid alphabet. 
The advantage is even more pronounced if, instead of a single amino acid, the 
input Ut is obtained by taking a window of amino acids, as explained later on. 

Learning is formulated in the framework of maximum likelihood and is solved 
by the expectation maximization (EM) algorithm. EM is a family of algorithms 
for maximum likelihood estimation in the presence of missing (or hidden) va- 
riables (in our case, forward and backward states Ft and Bt are the hidden 
variables). Let T>c denote the complete data (input and output sequences U and 
Y, plus the hidden state sequences F and B) and let T> denote the incomplete 
data (only the input and output sequences). Furthermore, let Lc denote the 
complete data likelihood, i.e. 

L,{9-V,)= n P{Y,F,B\U,e). 

training sequences 

Since forward and backward states are not observed, log Lc{9;'Dc) is a random 
variable that cannot be optimized directly. However, given an initial hypothesis 
9 on the parameters and observed variables U and Y, it is possible to compute 
the expected value of log Lc{9;'Dc)- Thus, an EM algorithms iteratively fills in 
missing variables according to the following procedure: 

E-step Compute the auxiliary function 

Q{9,9) = E[logL,{9;V,)\9,V] 

M-step Update the parameters by maximizing Q{9, 9) with respect to 9, i.e. 

9 ^ arg max Q(^, 6) 



and repeat until convergence. 




Bidirectional Dynamics for Protein Secondary Structure Prediction 



A well known result is that the above procedure will converge to a local maxi- 
mum of the likelihood function (Dempster et ah, 1977). A variant of the above 
procedure, called generalized EM (GEM) consists of computing a new parameter 
vector 6 such that Q{6,6) > Q{6,9) (instead of performing full optimization). 
GEM will also converge to a local maximum, although convergence rate may be 
slower. 

In the case of belief networks, the E-step essentially consists of computing 
the expected sufficient statistics for the parameters. In our case these statistics 
are the following expected counts: 

— Nj I expected number of (forward) transitions from /* to /■! when the 
input is t6 (j, ^ = 1, . . . , n; u = 1, . . . , K); 

— expected number of (backward) transitions from to when the 
input is u {k,£ = 1, . . . ,m; u = 1, . . . , K); 

— Nfj ^ y. expected number of times output symbol i is emitted at a given 
position t when the forward state at t is /A backward state at t is 6^, and 
the input at t is u. 

Basically, expected sufficient statistics are computed by inference using the jun- 
ction tree algorithm as a subroutine. If local conditional probabilities were mo- 
deled by multinomial tables, then the M-step would be straightforward: each 
entry in the table would be replaced by the corresponding normalized expec- 
ted count (Heckerman, 1997). However, in our case the M-step deserves more 
attention because of the neural network reparameterization of the local conditio- 
nal probabilities. In fact, maximizing the function Q{6^6), requires the neural 
network weights 6 to be adapted to perfectly fit the normalized expected suffi- 
cient statistics. Even in the absence of local minima, a complete maximization 
would require an expensive inner gradient descent loop, inside the outer EM 
loop. Hence, we resorted to a generalized EM algorithm, where a single gra- 
dient descent step is performed inside the main loop. The expected sufficient 
statistics are used as “soft” targets for training the neural networks. In parti- 
cular, for each output unit, the backpropagation delta-error term is obtained 
as the difference between the unit activation (before the softmax) and the cor- 
responding expected sufficient statistic. For example, consider the network for 
estimating the conditional probability of the output Y), given the forward state 
Ft, the backward state Bt, and the input symbol Ut- For each sequence and 
for each time step t, let be the activation of the i-th output unit of this 
network when fed by Ft = f^, Bt = b^ and Ut = Ut (where Ut is fixed ac- 
cording to the input training sequence). Let = exp(ai^t)/(X)£exp(a^_t). 

We have Zij^k,t = PiYt = y"\Ft = f^,Bt = b^,Ut = Ut,9). Moreover, let 
Zi,j,k,t = P(Yt = v\Ft = f^,Bt = b^,Ut = Ut, training data, 0) denote the 
contribution in this sequence at position t to the expected sufficient statistics 
(obviously, Zij^k,t = 0 if the observed output at position t, Ijt yf y*). Then, the 
error function for training this network is given by 

training sequences * hi.fc 



(2) 
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Similar equations hold for the other two networks modeling P{Ft\Ft-i^ Ut), and 
P{Bt\Bt+i,Ut). 

4 Bidirectional Recurrent Neural Nets 

4.1 The Architecture 

The basic idea underlying the architecture of BIOHMMs can be adapted to 
recurrent neural networks. Suppose in this case Ft and Bt are two vectors in JR" 
and JR’", respectively. Then consider the following (deterministic) dynamics, in 
vector notation: 



Ft = cj){Ft-i,Ut) (3) 

Bt = p{Bt+i,Ut) (4) 

where </>() and /3() are adaptive nonlinear transition functions. The vector Ut G 
JR^ encodes the input at time t (for example, using one-hot encoding in the 
case of amino acids). Equations 3 and 4 are completed by the two boundary 
conditions Fq = Bt+i = 0. Transition functions 4>{) and /3() are realized by two 
MLPs and Afg, respectively. In particular, for MLP A/g we have: 



hi,t = <y 



/N, \ 

I I 



z = 1, . . . n 



( 5 ) 



where cr is the logistic function, is the number of hidden units in MLP A/g, 
bi^t is the z-th component of Bt and 



( N/3 fc \ 

^Wj,ebi^t+i+^Vj^iUe,t\ J = (6) 

e=i e=i ) 

is the output of j-th hidden unit. In the above equations, u>ij,Wj^£ and Vj^i are 
adaptive weights. Network J\f^ is described by similar equations except that the 
forward state Ft is used instead of the backward state Bt- 
Also, consider the mapping 



Yt = ii{Ft,Bt,Ut) (7) 

where Yt G JR’’ is the output prediction and r/Q is realized by a third MLP A/);. 
In the case of classification, s is the number of classes and A/), has a softmax 
output layer so that outputs can be interpreted as conditional class probabilities. 
The neural network architecture resulting from eqs. 3,4, and 7 is shown in Fig. 
3, where for simplicity all the MLPs have a single hidden layer (several variants 
are conceivable by varying the number and the location of the hidden layers). 
Like in Elman’s simple recurrent networks, the hidden state Ft is copied back 
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copy copy 



Fig. 3. A bidirectional RNN. 



to the input. This is graphically represented in Fig. 3 using the causal shift ope- 
rator q~^ that operates on a generic temporal variable Xt and is symbolically 
defined as Xt-i = q~^Xt- The shift operator with the composition operation 
forms a multiplicative group. In particular, q, the inverse (or non-causal) shift 
operator is defined Xj+i = qXt and q~^q = 1. As shown in Fig. 3, a non-causal 
copy is performed on the hidden state Bt- Clearly, if we remove the backward 
chain {Bt} we obtain a standard first-order RNN. A BRNN is stationary if the 
connection weights in networks Afp, and Afri do not change over time. Statio- 
narity will be assumed throughout the paper. It is worth noting that using MLPs 
for implementing f3{) and (p{) is just one of the available options as a result of 
their well known universal approximation properties. Similar generalizations of 
second-order RNN (Giles et ah, 1992) or recurrent radial basis functions (Fras- 
coni et ah, 1996) are easily conceivable following the approach here described. 

4.2 Inference and Learning 

As for standard RNNs, it is convenient to describe inference and learning in 
BRNNs by unrolling the network on the input sequence. The resulting graphi- 
cal model has exactly the same form as the BIOHMM network shown in Fig. 
2. Actually, the BRNN can be interpreted as a Bayesian network, except for 
some differences as explained below. First, causal relationships among nodes 
linked by a directed edge should be regarded as deterministic rather than pro- 
babilistic. In particular, P{Yt\Ft, Bt,Ut) should be regarded as a delta-Dirac 
distribution centered on a value corresponding to Yj = rj(Ft, Bt,Ut). Similarly, 
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P{Ft\Ft-i,Ut) and P{Bt\Bt+i,Ut) are replaced by Dirac distributions centered 
at Ft = (j){Ft-i, Ut) and Bt = (3{Bt+i,Ut), respectively. The second difference is 
that state variables in this case are vectors of real variables rather than symbols 
in a finite alphabet. 

The inference algorithm in BRNNs is straightforward. Starting from Fg = 0, 
all the states Ft are updated from left to right, following eq. 3. Similarly, sta- 
tes Bt are updated from right to left using the boundary condition Bt+i = 0 
and following eq. 4. After forward and backward propagation have taken place, 
predictions Yt can be computed using eq. 7. The main advantage with BRNNs 
(compared to BIOHMMs) is that inference is much more efficient. The intuitive 
reason is that in the case of BRNNs, the hidden states Ft and Bt evolve inde- 
pendently (without affecting each other) . However, in the case of BIOHMMs, Ft 
and Bt, although conditionally independent given Ut, become dependent when 
Yt is also given (as it happens during learning). This is reflected by the fact 
that cliques relative to Bt and Ft in the junction tree contain triplets of state 
variables, thus yielding a time complexity proportional to for each time step 
(if both variable have the same number of states n). In the case of BRNNs, 
assuming that MLPs and A/^ have 0{n) hidden units, time complexity is 
only proportional to for each time step and this can be further reduced by 
limiting the number of hidden units. 

The learning algorithm is based on maximum-likelihood. For simplicity we 
limit our discussion to classification. In this case, the cost function (or negative 
log likelihood) has the form of a cross-entropy: 



S 

6) = ^ ^ yt^t log Vi,t 

training sequences 



where y^ t is the target output (equal to 1 if the class at position t is i, and 
equal to 0 otherwise) and yt^t is the i-th output of A/), at position t. Optimiza- 
tion is based on gradient descent, where gradients are computed by a noncausal 
version of backpropagation through time. The intuitive idea behind backpropa- 
gation through time is to fiatten-out cycles in the recurrent network by unrolling 
the recursive computation of state variables over time. This essentially consists 
of replicating the transition network for each time step, so that a feedforward 
network with shared weights is obtained. The same unrolling procedure can be 
applied in more general cases than just sequences. For example it can be ap- 
plied to trees and graphs, provided that the resulting unrolled network is acyclic 
(Goller & Kuechler, 1996; Frasconi et ah, 1998). In the case of BRNNs, it is 
immediate to recognize that the unrolling procedure yields an acyclic graph (if 
we only look at the main variables Yy,Bt,Ft, and Ut, the unrolled network has 
the same topology as the Bayesian network shown in Figure 2). The unrolled 
network can be divided into slices associated with different position indices t, 
and for each slice there is exactly one replica of each network A/)/,, Afy, and 
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The error signal is first computed for the leaf nodes (corresponding to the output 
variables Yt): 

X ^ --I .,-1 

^i,t — o — Vi,t Vi,t * — 1) ■ ■ ■ ) S, t — 1, . . . , J 

where denotes the activation of the f-th output unit of (before the soft- 
max). Then error is then propagated over time (in both directions) by following 
any reverse topological sort of the unrolled net. For example, the delta-errors 
of Aftf, at position t are computed from the delta-errors at the input of A/), at t, 
and from the delta-errors at the first n inputs of at position t -|- 1. Similarly, 
the delta-errors of Afg at position t are computed from the delta-errors at the 
input of Afri at t, and from the delta-errors at the last m inputs of Afg at position 
t — 1. Obviously, the computation of delta-errors also involves backpropagation 
through the hidden layers of the MLPs. Since the model is stationary, weights are 
shared among the different replicas of the MLPs at different time steps. Hence, 
the total gradients are obtained by summing all the contributions associated to 
different time steps. There are three possibilities: First, if i and j are any pair of 
connected neurons within the same position slice t, Si^t denotes the delta-error 
of unit i at t, and Xj^t denotes the output of neuron j at t, then 



dC 

dwij 



T 

t=l 



Second, if i is in the hidden 
Aff) at slice t -I- 1, then 



layer of Afp at slice t and j is in the output layer of 



dC 

dujij 



T 

t=l 



Finally, if i is in the hidden 
Af^ at slice t — 1, then 



layer of A^ at slice t and j is in the output layer of 



dC 

dujij 



T 

t=l 



4.3 Embedded Memories and Other Architectural Variants 

One of the principal difficulties when training standard RNNs is the problem of 
vanishing gradients. In (Bengio et ah, 1994), it is shown that one of the following 
two undesirable situations necessarily arise: either the system is unable to ro- 
bustly store past information about its inputs, or gradients vanish exponentially. 
Intuitively, in order to contribute to the output at time t, the input signal at 
time t — T must be propagated in the forward chain through r replicas of the NN 
that implements the state transition function. However, during gradient compu- 
tation, error signals must be propagated backward along the same path. Each 
propagation step involves a multiplication between the error vector and the Ja- 
cobian matrix associated with the transition function. Unfortunately, when the 
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dynamics develop attractors that allow the system to reliably store past infor- 
mation, the norm of the Jacobian is < 1. Hence, when r is large, gradients of the 
error at time t with respect to inputs at time t — t tend to vanish exponentially. 
Similarly, in the case of BRNNs, error propagation in both the forward and the 
backward chains is subject to exponential decay. Thus, although the model has 
in principle the capability of storing remote information, such information can- 
not be learnt effectively. Clearly, this is a theoretical argument and its practical 
impact needs to be evaluated on a per case basis. 

In practice, in the case of proteins, the BRNN can reliably utilize input in- 
formation located within about ±15 amino acids (i.e., the total effective window 
size is about 31). This was empirically evaluated by feeding the model with 
increasingly long protein fragments. We observed that the average predictions 
at the central residues did not significantly change if fragments were extended 
beyond 41 amino acids. This is an improvement over standard NNs with input 
window sizes ranging from 11 to 17 amino acids (Rost & Sander, 1994; Riis & 
Krogh, 1996). Yet, there is presumably relevant information located at longer 
distances that our model have not been able to discover so far. 

To limit this problem, we propose a remedy motivated by recent studies of 
NARX networks (Lin et ah, 1996). In these networks, the vanishing gradients 
problem is mitigated by the use of an explicit delay line applied to the output, 
which provides shorter paths for the effective propagation of error signals. The 
very same idea cannot be applied directly to BRNNs since output feedback, 
combined with bidirectional propagation, would generate cycles in the unrolled 
network. However, as suggested in (Lin et ah, 1998), a similar mechanism can be 
implemented by inserting multiple delays in the connections among hidden state 
units rather than output units. The modified dynamics in the case of BRNNs 
are defined as follows: 

Ft = 4>{Ft-i, Ft-2, ■ ■ ■ , Ft-s, It) 

Bt = (3{Bt+i, Bt+2, ■ ■ ■ , Bt+s, It)- 

The explicit dependence on forward or backward states introduces shortcut 
connections in the graphical model, forming shorter paths along which gradients 
can be propagated. This is akin to introducing higher order Markov chains in 
the probabilistic version. However, unlike Markov chains where the number of 
parameters would grow exponentially with s, in the present case the number 
of parameters grows only linearly with s. To reduce the number of parameters, 
a simplified version of Eq. 8 limits the dependencies to state vectors located s 
residues apart from t: 

Ft = (j){Ft-i, Ft-s, It) /'g'l 

Bt = (3{Bt+i, t + s, It). 

Another variant of the basic architecture which also allows to increase the effec- 
tive window size consists in feeding the output networks with a window in the 
forward and backward state chains. In this case, the prediction is computed as 

Ht V{Bt — kl ■ • ■ 5 Bt-\-k y Bt—ky • • • y Bt-\-k y It) • 



(10) 
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5 Datasets 

The assignment of the SS categories to the experimentally determined 3D struc- 
ture is nontrivial and is usually performed by the widely used DSSP program 
(Kabsch & Sander, 1983). DSSP works by assigning potential backbone hydrogen 
bonds (based on the 3D coordinates of the backbone atoms) and subsequently 
by identifying repetitive bonding patterns. Two alternatives to this assignment 
scheme are the programs STRIDE and DEFINE. In addition to hydrogen bonds, 
STRIDE uses also dihedral angles (Frishman & Argos, 1995). DEFINE uses dif- 
ference distance matrices for evaluating the match of interatomic distances in 
the protein to those from idealized SS (Richards & Kundrot, 1988). While assig- 
nment methods impact prediction performance to some extent (Cuff & Barton, 
1999), here we concentrate exclusively on the DSSP assignments. A number of 
data sets were used to develop and test our algorithms. We will refer to each set 
using the number of sequences contained in it. The first high quality data used 
in this study was extracted from the Brookhaven Protein Data Bank (PDB) 
(Bernstein & et ah, 1977) release 77 and subsequently updated. We excluded 
entries if: 

— They were not determined by X-ray diffraction, since no commonly used 
measure of quality is available for NMR or theoretical model structures. 

— The program DSSP could not produce an output, since we wanted to use the 
DSSP assignment of protein secondary structure (Kabsch & Sander, 1983). 

— The protein had physical chain breaks (defined as neighboring amino acids 
in the sequence having C“-distances exceeding 4.0A). 

— They had a resolution worse than 1.9A, since resolutions better than this 
enables the crystallographer to remove most errors from their models. 

— Chains with a length of less than 30 amino acids were also discarded. 

— From the remaining chains, a representative subset with low pairwise se- 
quence similarities was selected by running the algorithm #1 of Hobohm 
et al. (1992), using the local alignment procedure search (rigorous Smith- 
Waterman algorithm) (Myers & Miller, 1988; Pearson, 1990) using the 
paml20 matrix, with gap penalties -12, -4. 

Thus we obtained a data set consisting of 464 distinct protein chains, correspon- 
ding to 123,752 amino acids, roughly 10 times more than what was available 
in (Qian & Sejnowski, 1988). Another set we used is the EMBL non-redundant 
PDB subsets that can be accessed by ftp at the site ftp.embl-heidelberg.de. Data 
details are in the file /pub/databases/pdb_select/README. The extraction is 
based on the file /pub/databases/pdb_select/1998_june.25.gz containing a set 
of non-redundant (25%) PDB chains. After removing 74 chains on which the 
DSSP program crashes, we obtained another set of 824 sequences, overlapping 
in part with the former ones. In addition, we also used the original set of 126 
sequences of Rost and Sander (corresponding to 23,348 amino acid positions) as 
well as the complementary set of 396 non-homologue sequences (62,189 amino 
acids) prepared by Cuff and Barton (1999). Both sets can be downloaded at 
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Fig. 4. Unfolded bidirectional model for protein SS prediction. The model receives as 
input an explicit window of amino acids of size w (w = 3 in the figure). 



http://circinus.ebi. ac.uk:8081/pred_res/. Finally, we also constructed two more 
data sets, containing all proteins in PDB which are at least 30 amino acids long, 
produce DSSP output without chain breaks, and have a resolution of at least 
2.5 A. Furthermore the proteins in both sets have less than 25% identity to 
any of the 126 sequences of Rost and Sander. In both sets, internal homology 
is reduced again by Hobohm’s algorithm, keeping the PDB sequences with 
the best resolution. For one set, we use the standard 25% threshold curve for 
homology reduction. For the other set, however, we raise, the threshold curve 
by 25%. The set with 25% homology threshold contains 826 sequences, corre- 
sponding to a total of 193,249 amino acid positions, while the set with 50% 
homology threshold contains 1180 sequences (282,303 amino acids). Thus, to 
the best of our knowledge, our experiments are based on the currently largest 
available corpora of non-redundant data. In all but one experiment (see below) , 
profiles were obtained from the HSSP database (Schneider et ah, 1997) available 
at http://www.sander.embl-heidelberg.de/hssp/. 



6 Architecture Details and Experimental Results 

We carried out several preliminary experiments to tune up and evaluate the pre- 
diction system. DSSP classes were assigned to three secondary structure classes 
a, (3, and 7 as follows: a is formed by DSSP class H, (3 by E, and 7 by everything 
else (including DSSP classes F, S, T, B, and I). This assignment is slightly dif- 
ferent from other assignments reported in the literature. For example, in (Riis 
& Krogh, 1996), a contains DSSP classes H, G, and I. In the GASP competition 
(Moult & et ah, 1997; GASP3, 1998), a contains H, and G, while [3 contains 
E, and B. In a first set of experiments, we used the 824 sequences dataset and 
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reserved 2/3 of the available data for training, and 1/3 for testing. We trained 
several BRNNs of different sizes and different architectural details. In all expe- 
riments, we set n = m and we tried different values for n and k (see Eq. 10). 
The number of free parameters varied from about 1400 to 2600. Qualitatively 
we observed that using k > 0 can improve accuracy, but increasing n beyond 12 
does not help because of overfitting. Results for this method, without using pro- 
files, are summarized in the first rows of Table 1. By comparison, we also trained 
several feedforward NNs on the same data. The best feedforward NN achieved 
Q3 = 67.2% accuracy using a window of 13 amino acids. By enriching the feed- 
forward architecture with adaptive input encoding and output filtering, as in 
(Riis & Krogh, 1996), 68.5% accuracy was achieved (output filtering actually 
increases the length of the input window). Hence, the best BRNN outperforms 
our best feedforward network, even when additional architectural design is in- 
cluded. Subsequent experiments included the use of profiles. Table 1 reports the 
best results obtained by using multiple alignments, both at the input and out- 
put levels. Profiles at the input level consistently yielded better results. The best 
feedforward networks trained in the same conditions achieve Q 3 = 73.0% and 
72.3%, respectively. 



Table 1. Experimental results using a single BRNN and 1/3 of the data as test set. 
h^, h /3 and hr, are the number of hidden units for the transition networks A//>, A// and 
the output network A// respectively. We always set h^, = hp. 



Profiles 


n 


k 


h(p 


hrj 


W 


Accuracy 


No 


7 


2 


8 


11 


1611 


Qs =68.7% 


No 


9 


2 


8 


11 


1899 


Qs =68.8% 


No 


7 


3 


8 


11 


1919 


Qs =68.6% 


No 


8 


3 


9 


11 


2181 


Qs =68.8% 


No 


20 


0 


17 


11 


2601 


Qs =67.6% 


Output 


9 


2 


8 


11 


1899 


Qa =72.6% 


Output 


8 


3 


9 


11 


2181 


Qa =72.7% 


Input 


9 


2 


8 


11 


1899 


Qa =73.3% 


Input 


8 


3 


9 


11 


2181 


Q 3 =73.4% 


Input 


12 


3 


9 


11 


2565 


Qa =73.6% 



In a second set of experiments (also based on the 824 sequences), we combi- 
ned several BRNNs to form an ensemble, as in (Krogh & Vedelsby, 1995), using a 
simple averaging scheme. Different networks were obtained by varying architec- 
tural details such as n, k, and the number of hidden units. Combining 6 networks 
using profiles at the input level we obtained the best accuracy Q 3 = 75.1%, mea- 
sured in this case using 7-fold cross validation. We also tried to include in the 
ensemble a set of 4 BRNNs using profiles at the output level but performance in 
this way slightly decreased to 75.0%. A study for assessing the capabilities of the 
model in capturing long ranged information was also performed. Results indicate 
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that the model is sensitive to information located within about ±15 amino acids. 
Although this value is not very high, it should be remarked that typical feedfor- 
ward nets reported in the literature do not exploit information beyond r = 8. 
To further explore the long-range information problem we conducted another 
set of experiments using BRNNs with simplified embedded memories (see Eq. 
9). In this case, as for the results reported in Table 1, we used a single model 
(rather than a mixture) and the test set method (1/3 of the available data) for 
measuring accuracy. We tried all values of s from 1 to 10, but in no case we could 
observe a significant performance improvement on the test set. Interestingly, our 
experiments showed that using shortcuts reduces the convergence difficulties as- 
sociated with vanishing gradients: accuracy on the training set increased from 
75.7% using no shortcuts to 76.9% with s = 3. On the other hand, the gap 
between training set and test set performance also increased. Thus overfitting 
offset the convergence improvement, probably because long-range information is 
too sparse and noisy. 

Another experiment was conducted by training on all the 824 sequences and 
using the official test sequences used at the 1998 CASP3 competition. In this 
case, we adopted a slightly different class assignment for training (DSSP classes 
H, G, and I were merged together). The CASP3 competition was won by one 
of the two programs entered by D. Jones, which selected 23 out of 35 proteins 
obtaining a performance of Q 3 = 77.6% per protein, or Qs = 75.5% per resi- 
due (Jones, 1999). We evaluated that system on the whole set of 35 proteins by 
using Jones’ prediction server at http://137.205.156.147/psiform.html. It achie- 
ved Qs = 74.3% per residue and 76.2% per protein. On the same 35 sequences 
our system achieved Q3 = 73.0% per residue and Q 3 = 74.6% per protein. A 
test set of 35 proteins is relatively small for drawing general conclusions. Still, 
we believe that this result confirms the effectiveness of the proposed model, es- 
pecially in consideration of the fact that Jones’ system builds upon more recent 
profiles from TrEMBL database (Bairoch & Apweiler, 1999). These profiles con- 
tain many more sequences than our profiles, which are based on the older HSSP 
database, leaving room for further improvements of our system. 



Table 2. First confusion matrix derived with an ensemble of 6 BRNNs with 2/3-1/3 
data splitting. First row provides percentages of predicted helices, sheets, and coils 
within (DSSP-assigned) helices. 



a 

13 

7 



pred a pred (3 pred 7 
78.61% 3.13% 18.26% 
5.00% 61.49% 33.51% 
10.64% 9.37% 79.99% 



To further compare our system with other predictors, as in (Cuff & Barton, 
1999), we also trained an ensemble of BRNNs using the 126 sequences in the Rost 
and Sander data set. The performance on the 396 test sequences prepared by Cuff 
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Table 3. Same as above. First row provides percentages of (DSSP-assigned) helices, 
sheets, and coils within the predicted helices. 





a (5 ^ 


pred a 


80.77% 3.74% 15.49% 


pred j3 


5.11% 73.17% 21.72% 


pred 7 


11.71% 15.63% 72.66% 



and Barton is Q3 = 72.0%. This is slightly better than the 71.9% score for the 
single best predictor (PHD) amongst (DSC, PHD, NNSSP, and PREDATOR) 
reported in (Cuff & Barton, 1999). This result is also achieved with the CASP 
class assignment. Finally, we also trained an ensemble of 6 BRNNs using the 
set containing 826 sequences with less than 25% identity to the 126 sequences 
of Rost and Sander. When tested on the 126 sequences, the system achieves 
Qz = 74.7% per residue, with correlation coefficients Ca = 0.692, C/3 = 0.571, 
and = 0.544. This is again achieved with the harder CASP assignment. In 
contrast, the Qz = 75.1% described above was obtained by 7 fold cross-validation 
on 824 sequences and with the easier class assignment (77 a, E ^ f3, the 
rest — >■ 7). The same experiment was performed using the larger training set of 
1,180 sequences having also less than 25% identity with the 126 sequences of 
Rost and Sander, but with a less stringent redundancy reduction requirement. 
In this case, and with the same hard assignment, the results are Qz = 75.3% 
with correlation coefficients Ca = 0.704, C /3 = 0.583, and C-y = 0.550. The 
corresponding confusion matrices are given in Tables 2 and 3. Table 4 provides 
a summary of the main results with different datasets. 



Table 4. Summary of main performance results. 



Training sequences 


Test sequences 


Class assignment 


Performance 


824 (2/3) 


824 (1/3) 


Default 


Qz = 75.1% 


824 


35 


CASP 


Qz = 73.0% 


126 


396 


CASP 


Qz = 72.0% 


826 


126 


CASP 


Qz = 74.7% 


1180 


126 


CASP 


Qz = 75.3% 



Although in principle bidirectional models can memorize all the past and 
future information using the state variables Ft and Bt, we also tried to employ a 
window of amino acids as input at time t. In so doing, the input Ut for the model 
is a window of w amino acids centered around the t-th residue in the sequence (see 
Fig. 4). As explained in the previous sections, both with BIOHMMs and BRNNs 
the prediction Yt is produced by an MLP fed by Ut (a window of amino acids) and 
the state variables Ft and Bt - Hence, compared to the basic architecture of Qian 
and Sejnowski, our architecture is enriched with more contextual information 
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provided by the state variables. The main advantage of the present proposal is 
that w can be kept quite small (even reduced to a single amino acid), and yet 
relatively distant information propagated through the state variables. Because 
of stationarity (weight sharing) this approach allows a better control over the 
number of free parameters, thus reducing the risk of data overfitting. 

In a set of preliminary experiments, we have tried different architectures and 
model sizes. In the case of BIOHMMs, the best result was obtained using w = 11, 
n = m = 10, 20 hidden units for the output network and 6 hidden units for the 
forward and backward state transition networks. The resulting model has about 
10^ parameters in total. The correct residue prediction rate is 68%, measured by 
reserving 1/3 of the available sequences as a test set. This result was obtained 
without using output filtering or multiple alignments. Unfortunately, n = 10 
seems too small a number for storing enough contextual information. On the 
other hand, higher values of n are currently prohibitive for today’s computational 
resources since complexity scales up with n^. 

In the case of BRNNs, we were able to obtain slightly better performances, 
with significant computational savings. A set of initial experiments indicated 
that redefining the output function as 1/ = rj{Ft-k, ■ ■ ■ , Ft+k, Bt+k,Ut) 
and using ru = 1 yields the best results. In subsequent experiments, we have 
trained 4 different BRNNs with n = m varying from 7 to 9, and k varying 
from 2 to 4. The number of free parameters varies from about 1400 to 2100. An 
RNN can develop quite complex nonlinear dynamics and, as a result, n BRNN 
state units are able to store more context than n BIOHMM discrete states. The 
performances of the 4 networks are basically identical, achieving about 68.8% ac- 
curacy measured on the test set. While these results do not lead to an immediate 
improvement, it is interesting to remark that using a static MLP we obtained 
roughly the same accuracy only after the insertion of additional architectural 
design as in (Riis & Krogh, 1996): adaptive input encoding and output filtering. 
More precisely, the MLP has w = 13, with 5 units for adaptive encoding (a total 
of about 1800 weights) and achieves 68.9%. Interestingly, although the 4 BRNNs 
and the static MLP achieve roughly the same overall accuracy, distributions of 
errors on the three classes are quite different. This suggests that combining pre- 
dictions from filtered MLP and BRNNs could improve performance. Indeed, by 
constructing an ensemble with the five networks, accuracy increased to 69.5%. 
Finally we enriched the system using an output filtering network on the top of 
the ensemble and adding multiple alignment profiles as provided by the HSSP 
database (Schneider et ah, 1997). In this preliminary version of the system, we 
have not included commonly used features like entropy and number of insertions 
and deletions. The performance of the overall system is 73.3%. 

In a second set of experiments, we measured accuracy using 7-fold cross 
validation. The usage of more training data in each experiments seems to have a 
positive effect. The performance of the five networks ensemble is 69.6% without 
alignments and 73.7% using alignments. We must remark that these results 
are not directly comparable with those reported by Rost and Sander (1994) 
because our dataset contains more proteins and the assignment of residues to 
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SS categories is slightly different (in our case the class coil includes everything 
except DSSP classes ’H’ and ’E’). 

The last experiment is based on a set of 35 proteins from the 1998 edition 
of “Critical Assessment of Protein Structure Prediction” (Moult & et ah, 1997; 
CASP3, 1998). This unique experiment attempts to gauge the current state of 
the art in protein structure prediction by means of blind prediction. Sequences 
of a number of target proteins, which are in the process of being solved, are 
made available to predictors before the experimental structures are available. 
Although we tried our system only after the competition was closed, we believe 
that result obtained on this dataset are still interesting. Our system achieved 
71.78% correct residue prediction on the 35 sequences. A direct comparison with 
other systems is difficult. The best system (labeled JONES-2 in the CASP3 
web site) achieves 75.5% correct residue prediction on a subset of 23 proteins 
(performance of JONES-2 on the remaining 12 proteins is not available). It 
should be also remarked that, in the GASP evaluation system, DSSP class ’G’ 
(3-10 helix) is assigned to ’H’ and DSSP class ’B’ (beta bridge) is assigned to 
’E’. Moreover, accuracy is measured by averaging the correct prediction fraction 
over single proteins, thus biasing sensitivity towards shorter sequences. Using 
this convention, our accuracy is 74.1% on 35 proteins while JONES-2 achieves 
77.6% on 23 proteins. If we focus only on the 24 proteins for which our network 
has the highest prediction confidence (the criterion is based on the entropy at 
the softmax output layer of the network), then the performance of our system 
is 77.5%, although it is likely that in so doing we are including sequences which 
are easy to predict. More importantly, JONES-2 results have been obtained 
using profiles from TrEMBL database (Bairoch & Apweiler, 1999). These profiles 
contain many more sequences than our profiles which are based on the older 
HSSP database. We believe that this leaves room for further improvements. 



7 Conclusion 

In this paper, we have proposed two novel architectures for dealing with sequence 
learning problems in which data is not obtained from physical measurements 
over time. The new architectures remove the causality assumption that charac- 
terize current connectionist approaches to learning sequential translations. Using 
BRNNs on the protein secondary structure prediction task appears to be very 
promising. Our performance is very close to the best existing systems although 
our usage of profiles is not as sophisticated. One improvement of our prediction 
system could be obtained by using profiles from the TrEMBL database. 
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1 Introduction 

The prototypical use of “classical” connectionist models (including the multi- 
layer perceptron (MLP), the Hopfield network and the Kohonen self-organizing 
map) concerns static data processing. These classical models are not well suited 
to working with data varying over time. In response to this, temporal connectio- 
nist models have appeared and constitute a continuously growing research field. 
The purpose of this chapter is to present the main aspects of this research area 
and to review the key connectionist architectures that have been designed for 
solving temporal problems. 

The following section presents the fundamentals of temporal processing with 
neural networks. Several temporal connectionist models are then detailed in 
section 3. As a matter of illustration, important applications are reviewed in the 
third section. The chapter concludes with the presentation of a promising future 
issue: the extension of temporal processing to even more complex structured 
data. 

2 Fundamentals 

Before actually getting into the fundamentals of temporal connectionist models, 
we have to clarify some possible confusion between different kinds of time repre- 
sentation. 

The issue we are concerned with is the study of models which are able to take 
the “naturaF time of a problem into account. Some ambiguity may arise from the 
fact that some models also have an internal use of time. This internal time does 
not however correspond to any temporal dimension of the problem dealt with; the 
problem considered by such a model could even be static (e.g. image recognition). 
Internal use of time is only necessary for the model’s own dynamics (e.g. for 
relaxation of inner states to some equilibrium as in the Hopfield’s network). 

To make the difference clear, we shall speak of external time for the time 
of the problem which is considered. Some sort of mixed computation could of 
course happen where the use of internal time is also extended to the processing of 
the input sequence. To the best of the authors’ knowledge however, no such link 
between internal and external time has ever been investigated in the literature. 

The chapter focuses only on neural network architectures that handle the 
temporal problem, i.e. that deal with external time. 
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Several aspects of time may be involved in temporal problems. These aspects 
correspond to different properties of time such as: 

— time as a simple order relation (where time only consists of an index used 
to order events, e.g. the time embedded in the reading of a sentence); 

— time as metrics (e.g. the time involved in speech, where duration is mea- 
ningful) ; 

— discrete time versus continuous time; 

— time over a finite versus infinite interval (never ending problems) . 



2.1 A Short Historical Overview 

The chronological ordering of temporal connectionist models is characterized by 
the increase of the integration of time in the architecture. 

A first phase in the development of temporal neural networks is characterized 
by architectures based on classical models, but locally modified so as to take 
the time dimension into account. For instance, recurrent networks based on the 
MLP were introduced. They include backward links with a one time step delay 
that represents the ordering relation between two successive inputs. This kind 
of approach typically focuses on temporal sequence learning. 

For example, (Jordan, 1986) designed such an architecture in which the ou- 
tput vector at time t— 1 is concatenated to the input vector at time t.^ This 
architecture was used for modelling the sequence of movements of a robot arm. 

(Elman, 1990) also designed a similar architecture, in which the hidden vec- 
tor rather than the output vector is concatenated to the input vector. This 
architecture was successfully used for sentence processing. 

These modifications of the architecture also affect the learning process: the 
backward links have to be taken into account. The learning algorithm used in this 
case, called “back-propagation through time”, is a generalisation of the standard 
back-propagation learning algorithm (see for instance the paper of (Williams & 
Peng, 1990)). 

A second phase in the development of temporal neural networks still consi- 
dered classical architectures but focused on configuring their parameters more 
globally so as to be able to accommodate enough temporal informations for sol- 
ving the problem. A first important parameter which can take into account the 
time dimension is the input vector: this could be a temporal window over the 
input signal for example. This choice triggers other choices such as the dimension 
of the hidden layers which needs to be proportional to the dimension of the input 
temporal window. A good prototype from this second phase is the TDNN (Lang 
et ah, 1990)^. This kind of architecture typically focuses on temporal pattern 
recognition of the kind needed in speech processing. 

After the use of adapted classical architectures, a third phase appeared in the 
development: architectures specially designed for time processing appeared. Most 

^ see section 3.2 for further details on this architecture. 

^ This model is described in section 3.1. 
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of them are inspired by our knowledge of information processing in “natural” 
neural networks. These try to mimic one or several of their characteristics such 
as: 

1. propagation delays on connections; 

2. neurons sending discrete pulses rather than continuous activity (i.e. “spiking 
neurons” ) ; 

3. neuron activity being also a function of time; for instance introducing some 
refractory period; 

4. synchronization between neuron populations. 

For instance, RST (Chappelier & Grumbach, 1998) which has been applied 
to finding the moving part of an image, uses properties 2 and 3 above^. 

Considering architectures using spiking neurons, (Maass, 1997b) studied their 
computational power and showed that they have at least the same computational 
power as a classical MLP, but need fewer neurons. 

Although several other architectures have also been designed with this kind 
of approach, we are still only at the beginning of this third phase. 

2.2 Time Integration in Connectionist Models 

Let us now detail the different approaches of time integration in connectionist 
models such as sketched out in the previous section. The hierarchy of models 
detailed hereafter is summarized in figure 1 (see also the paper of (Chappelier 
& Grumbach, 1994)). 




Fig. 1. A classification of connectionist models with respect to time integration. 



® A description of this architecture is given in section 3.4. 
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External Representation of Time. The first approach in the integration 
of time into connectionist models consists in not introducing it directly in the 
architecture, rather leaving the time representation outside the neural network. 
The idea is to preprocess the data so that classical static connectionist models 
can proceed with the temporal task. Time is preprocessed through a time to 
space transformation; the network accessing then only spatial information, a 
dimension of which has semantics related to time^. 

Researchers who have taken this approach include (Simpson & Deich, 1988), 
(Gorman & Sejnowski, 1988), (Bengio et ah, 1989) and (Goldberg & Pearlmutter, 

1989) . One typical model of this category is the well known TDNN (Lang et ah, 

1990) , which is described in section 3.1. 

(Elman, 1990) points out that there are several disadvantages with this kind 
of approach. First, it requires some buffering: how long should the buffer be 
and how often does the network look at it? This approach imposes therefore a 
rigid limit on the duration of patterns, which is not necessarily appropriate to 
the real temporal input. Secondly, no difference is made between relative and 
absolute temporal positions. Most of these architectures do not even have any 
representation of absolute temporal position at all. Finally, these methods also 
suffer from inadequate/inffexible time windowing and from over-training caused 
by an excessive number of weights, even if some clever strategies are used to 
share weights over time (as in TDNN). 



Time as an Internal Index. Time itself can be introduced into connectionist 
models at several levels. First of all, time can be used as an index in a sequence of 
network states. There is no actual representation of time in the network strictly 
speaking but rather a use of time as an internal variable controlling the inner 
mechanism. We may say in this case that time is implicitly present in the model. 
This kind of network is typically illustrated by recurrent networks (RNNs, de- 
scribed further in section 3.2). As explained later, these models are nevertheless 
very powerful for temporal processing of sequences. 



Time at the Connection Level. A step further in the introduction of time 
in a neural model is to represent it explicitly at the level of the network either 
on connections or at the neuron level (or both). 

In the case where time is represented at the level of connections, it is usually 
done by some delays of propagation on the connections (“temporal weights”), 
or more generally by some convolution of the neuron input by a given tempo- 
ral kernel. The works of (Beroule, 1987), (Jacquemin, 1994) and (Amit, 1988) 
are, among others, three different but representative approaches of connections 
carrying temporal information. 

Problems that are tackled with such architectures typically involve temporal 
matching between events. 

For instance, one effective way of constructing a spatial representation of temporally- 
occurring information (which is not only limited to neural computing) is to create 
the power spectrum of the incoming information and use it as a static input image. 
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Time at the Neuron Level. At the level of the neuron itself, there are se- 
veral ways of introducing time, depending on how temporal information will be 
represented by the neural network activity. The two main approaches are: 

— the information is contained in the time sequence of the neuron activities; 

— the neuron activity consists of discrete events (“pulses” or “spikes”), the 
information being conveyed by the timings of these events. 

These two points of view are summarized in figure 2. 



neural signal 




neural signal 




(b) 



Fig. 2. An illustration of the two different points of view on temporal coding at the 
neuron level: a) variations (in time) of the amplitude of the neuron signal; b) different 
timings of neural events. 



Static neural networks can clearly be emulated by the first type of temporal 
neural networks as they constitute a very special case (no time) . It is furthermore 
interesting to notice that “static” neural networks can also be emulated with 
infinite precision in the second framework (Maass, 1997a, 1997b), the activity of 
a given static neural network being transcoded into a set of timings of a spiking 
neural network. Networks of spiking neurons therefore have at least the same 
computational power as classical neural networks. 

The introduction of time at the neuron level can be done either by simulating 
biological properties or by building up neuron models from an engineering point 
of view, introducing time without specific biological inspiration. 

The first approach usually leads to neuron models based on differential equa- 
tions (Rinzel & Ermentrout, 1989; Abbott & Kepler, 1990). The model most 
often used in this context is the so-called “integrate and fire” model, or the 
“leaky-integrator” . The principle underlying these models consists in summing 
the inputs of the neuron over a period of time. When this sum becomes greater 
than a given threshold (specific to each neuron), the neuron state changes. For 
a survey of this kind of model, we refer to the paper of (Gerstner, 1995). 

On the other hand, the engineering approach is often purely algebraic: 

— by changing the usual representation (scalar) to complex numbers (Vaucher, 
1996), 

— or by introducing an artificial time varying bias (Horn & Usher, 1991), 
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— or by considering that standard equation of neural networks Uj = f Wij Xi) 
concerns equilibrium and could be generalized to 




which is finally equivalent to an “integrate and fire” model. 

The introduction of time at the neuron level often leads to dynamical pro- 
perties and complex behaviours implying oscillations (Horn & Usher, 1991) and 
synchronizations at the network level (Hirch, 1991; Ramacher, 1993; Turner & 
Huberman, 1992). 

2.3 Temporal Components of Connectionist Models 

Having detailed the different approaches to the time integration in connectionist 
models, we are now able to detail the different temporal components used in 
these models. 

Any connectionist architecture that processes temporal patterns contains two 
(at least conceptually) distinct components: a short-term memory and a pre- 
dictor®. 

The short term memory has to retain those aspects of the input sequence 
that are relevant for the problem. The predictor, on the other hand, uses the 
content of the short term memory to predict /classify and produce some output. 

These two modules can either be embedded in each other or be as explicitly 
distinct as, for instance, in the model of (Catfolis, 1994), which consists of a 
RNN followed by a MLP. When explicit, the predictor will most generally be a 
classical static connectionist architecture. 

Concerning the short term memory, several types are considered which can 
be classified along three axes: memory form, memory content and memory pla- 
sticity. These concepts were introduced rather informally by (Mozer, 1994). In 
order to formalize them a little further, we define a memory as some function / 
of time and previous inputs: f{t,x{t — 1), ...,x{t — k)). 

Memory form 

The memory form deals with the function / itself. It can be as simple as a 
buffer containing the k most recent inputs®. This kind of memory is used 
within the spatial approach to time representation. 

But the memory form does not have to necessarily consist of the raw input 
sequence itself. It could include some transformation of the representation 
of the input. Such types of memory form include decaying neuronal activity 
or delayed connections performing some convolution of the neuron input 
signal”^. 

® Several other authors make this distinction including (Mozer, 1994) and (de Vries & 
Principe, 1992). 

® In this case the function / returns a vector of size k consisting of x{t— 1), ..., x{t — k). 
^ For instance f = K{t) (i$x{t) for some function K being the “kernel” of the memory. 
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Memory content 

The memory content is related to the number of arguments / actually takes 
into account. This can be related to the Markovian/non-Markovian aspect 
of the problem. “Markovian” means that the output at any time step can 
be determined uniquely from the input and target values for a given number 
of time steps in the recent past®. It is the purpose of the memory to keep 
these past values, either directly or combined with previous memory/output 
states. 

Notice that Markovian problems of order higher than 1 can always be trans- 
formed into a Markovian problem of order 1. In the context of neural net- 
works, this transformation leads to some “delay lines” where the input at a 
given time step is augmented with copies of k past inputs. However, if k is 
large, the network is likely to be overwhelmed by large amounts of redundant 
and irrelevant information. A standard technique for reducing the number 
of weights is so-called “weight sharing” or “weight tying”, constraining a 
given set a weights to have the same (unspecified) numerical value®. Sharing 
of weights is linked to some known symmetries of the problems. However it 
should be emphazised that using RNNs rather than delay lines keeps down 
the number of parameters to be learned. 

Non-Markovian problems, in which outputs are dependent on inputs that are 
an unbounded distance in the past, require the memory content to depend 
on some internal states^®. 

Memory plasticity 

Memory plasticity focuses on how the memory evolves through time^^, which 
can formally be defined as 

The memory can either be static, when all its parameters are fixed in ad- 
vance, or adaptive. Adaptive memory consisting in learning either delays 
(Bodenhausen & Waibel, 1991; Unnikrishnan et ah, 1991), decay rates (Mo- 
zer, 1989; Frasconi et ah, 1992) or some other parameters characterizing the 
memory (de Vries & Principe, 1992). 

Static memories are mentioned here since they still can be of some interest 
when there is adequate domain knowledge to constrain the type of informa- 
tion that should be represented in the memory. For instance, the memory 
may have a high resolution for recent events and decreasing resolution for 
more distant ones as illustrated by the work of (Tank & Hopfield, 1987). 



2.4 The Computational Power of Temporal Connectionist Models 

On the basis of the classification made in the previous sections, this section 
addresses the question of the kind of temporal problems connectionist models 
can actually handle. 

® i.e. depends only on a temporal neighbourhood of the input. 

® For a illustration of that point, see TDNN explained in the next section, 
which are not explicited in the chosen presentation of the memory form / but rather 
“hidden” in the t (first argument) dependency of /. 
i.e. “adapting” or “learning”. 
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Standard (Static) Networks. It is widely known that classical neural net- 
works'^, even with one single hidden layer, are universal function approximators 
(Hornik et ah, 1989; Hornik, 1991). This means that any continuous function 
with compact domain and compact range can be approximated with an arbitra- 
rily degree of precision with regard to the norm of uniform convergence by a 
network of this type (provided that it has enough hidden neurons) . This univer- 
sality theorem provides a theoretical framework for the application of MLPs to 
various problem domains and explains their success. 

Such feedforward networks, inherently static, can nevertheless have some ap- 
plications in temporal domain, as already explained in the last two sections. The 
most simple temporal problems could be handled using feedforward networks. 
Indeed, MLPs are enough to learn such mapping where the input data at each 
time step contains enough information to determine the output at that time, 
(i.e. where o{t) = F(i(t)), with i is the input and o the output of the network). 

It can furthermore be proved that such static models implement some non- 
linear generalization of usual statistical AR-predictors; i.e. o{t) = ..., i{t — 

k)) (Lapedes & Farber, 1987). 



Recurrent Networks. Similarly to the universality of static neural-networks 
in function approximation, recurrent neural networks (RNNs) are universal ap- 
proximators of dynamical systems (Funahashi & Nakamura, 1993). It is also 
easy to show that RNNs can simulate any arbitrary finite state machine (FSA) 
(Cleeremans et ah, 1989). (Siegelmann & Sontag, 1995) have even shown that a 
RNN can simulate any Turing Machine. 

In the case where the previous output is needed as well as the current input to 
determine the output, i.e. if o{t) = F{i{t),o{t — 1)), simple RNNs with feedback 
weighted connections from the ordinary target nodes (such as Jordan networks) 
or even feedforward networks with ’teacher forcing’ techniques could be employed 
(Rohwer, 1994). 

In the most general case however, the approximation theorem mentioned 
before constitutes only a theoretical result as they do not say anything on how 
a given machine should be approximated. These theorems do unfortunately not 
imply that RNNs could easily be trained from examples to do complex temporal 
tasks. 

This is the reason why relatively few cases have been reported showing the 
successful learning of complex dynamics by fully connected RNNs. One of the 
reasons lies in the cost of the computing of the error gradient for large scale 
RNNs. High-order Markovian problems present severe difficulties to temporal 
neural networks. Although some preliminary solutions have been proposed (Ben- 
gio et ah, 1993; Hochreiter & Schmidhuber, 1997b), the computation of long time 
dependencies still remains a major problem for RNNs. For a good overview and 
critique on this point, we refer to the contribution of (Hochreiter & Schmidhuber, 
1997a). 
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more precisely MLP. 
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Spiking Neural Networks. It can be demonstrated (Maass, 1994, 1996) that 
a “spiking neural network” is at least as powerful as a Turing machine. This 
means that such an architecture is, with a finite number of neurons and with 
a boolean input, able to simulate in real-time any Turing machine with a finite 
number of tapes. 

Furthermore, spiking neural networks are strictly more powerful than Turing 
machine. Indeed they can also simulate some machine that a Turing machine 
can not, since spiking neural networks are able to handle computation with real 
numbers (differences of temporal events). 

It is then easy to claim that spiking neural networks are unlimited with 
respect to temporal computing. However, the problem remains that, as with 
RNNs, such a powerful theorem does unfortunately not tell us how a given 
function can be learned by such networks. Still, the implementation of several 
fundamental functions (such as multiplication, addition, comparison) has been 
detailed by (Maass, 1996). 

Summary. As far as estimation is concerned, a parallel could be made between 
usual statistical estimators, that are inherently linear, and non-linear estimators 
resulting from temporal neural networks. The non-linear equivalent of autore- 
gressive (AR) estimator is the “standard” MLP, whereas non-linear equivalent 
of autoregressive with moving average (ARMA) estimators are represented by 
RNNs (Connor et ah, 1992; Connor & Martin, 1994). Temporal connectionist 
models therefore appear to be a richer and more powerful family of estimators 
than the usual linear ones. 

A summary of all the temporal potentialities of connectionist models is given 
in table 1. 

Table 1. A summary of theoretical properties of temporal connectionist models. 



connectionist model 


able to emulate 


non-linear generalization 
of estimator 


standard static networks 


continuous mapping 


AR 


RNNs 


dynamical system/Turing Machine 


ARMA 


spiking neural networks 


Turing machine 


any? 



3 Most Relevant Temporal Connectionist Models 

The aim of this section is to present in detail some temporal connectionist models 
in order to more concretely illustrate the theoretical issues presented up to here. 

i.e. a connectionist architecture built on spiking neurons and having some propaga- 
tion delay on the connections; 
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3.1 TDNN 

The “Time-Delay Neural Network” (TDNN) model is a modification of the MLP 
architecture, the input of which consists of a “delay line”, i.e. a whole set of 
“time-slices” as illustrated in figure 3. This set of time slices is shifted from left 
to right at each time step. The hidden layer is also divided into sub-slices (which 
are not shifted but computed from the previous layer). In order to keep the time 
consistency, the weights are set to be equal among time slices. In that sense, 
times slices “share” a unique weight set (one for each layer). For instance the 
weight of the connection from the first neuron of a time slice of the input layer 
to the first corresponding neuron of the hidden layer is always the same among 
all time slices of the input layer (see figure 3). 



oooooo 




t 

Fig. 3. TDNN architecture: a MLP architecture modified so that units take their inputs 
from only a part of the previous layer corresponding to a time slice. Different time slices 
are represented with different line styles. All time slices of a given layer share the same 
weights. 



This model has been introduced by (Lang et ah, 1990) in a speech recognition 
context. The problem considered there was to recognize four kinds of phoneme. 
The input of the TDNN consisted in a sequence of spectrogram “slices” . 

3.2 Recurrent Networks 

Recurrent neural networks (RNNs) constitute the major family of temporal 
connectionist models. A RNN can be defined in the most general way as a neural 
network containing at least one neuron the state of which depends either direc- 
tly or indirectly on at least one of its anterior states. Formally, a RNN can be 
described by an equation like: 

X* = /(Xt_i,Ut,f,0/) 

Yt = g{Xt,t,el) 



( 1 ) 
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where U stands for the input signal sequence, Y for the output sequence and 
0s for the parameters of the network (typically the weights). X represents the 
set of recurrent variables. 

There are a huge number of RNNs which have been proposed by various 
groups (Jordan, 1986; Williams & Zipser, 1989; Narendra & Parthasarathy, 1990; 
Elman, 1990; Back & Tsoi, 1991; de Vries & Principe, 1992). Some of these ar- 
chitectures do not bear much resemblance (at least superficially) to one another. 
There were therefore many attempts to find unifying themes in this variety of 
architectures (Narendra & Parthasarathy, 1990; Nerrand et ah, 1993; Tsoi & 
Back, 1997; Tsoi, 1998). 

As a matter of illustration, we now detail three examples of RNNs. 



Jordan and Elman Networks. The (Jordan, 1986) and (Elman, 1990) archi- 
tectures are RNNs both based on the MLP architecture. They consist of adding 
recurrent links from one part of the network to the input layer, either from the 
output layer ((Jordan, 1986), figure 4a) or from the hidden layer ((Elman, 1990), 
figure 4b). At time t — 1, the recurrent part is copied into the input layer as a 
complement to the actual input vector at time t (i.e. the signal to be processed). 
When computing a new output, the information goes downstream as in a classi- 
cal MLP, from the input to the output layer. The complementary input vector, 
that is copied after each computation step, thus represents context information 
about the past computation. 
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O O Output layer 
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(a) (b) 



Fig. 4. Two typical recurrent neural networks: (a) Jordan’s and (b) Elman’s architec- 
tures. 



Mathematically speaking, the Elman’s network is described by the following 
simplification^^ of equation 1: 

Xt = /(Xt_i,Ut,0/) 

Yt = g{Xt,03) 
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Xt and Yt do not depend directly on t, and 0s do not depend on t at all. 
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where U is the actual input vector sequence, X is the vector sequence of neuron 
states from the hidden layer and Y is the vector sequence of states from the 
output layer (see figure 4b). 0^ represents the set of weights from the input 
layer to the hidden layer, the set of weights from the hidden layer to the 
output layer and the / and g functions represent a vectorial form of the usual 
sigmoidal function. Notice that the input layer of the network is made up of the 
concatenation of Uj and Xt_i. 

Similarly, Jordan’s architecture can also be expressed in terms of another 
simplification of equation 1 as: 

Xt = /(X,_i,Ut,0O 

where X is the vector sequence of states from the output layer (as well as Y for 
notation compatibility purposes) . The function / needs however to be developed 
further. It results from the combination of the hidden and the output layer 
computations: 

/ (Xt_i, Ut, 0f) = h (/i (Xt_i, U*, 0(D) , 0(2)^ 

with fi representing the computation of hidden layer states from input layer 
states, /2 the computation of output layer states from hidden layer states and 
0 / = ( 0 ( 1 ), 0 ( 2 )), 

Jordan and Elman networks are typically used for memorizing (and recalling) 
sequences such as poems (sequences of words) , robot arm movements (sequences 
of positions), etc... 



(5-NARMA. The J-NARMA neural network model (Bonnet et ah, 1997c, 1997b) 
was designed for signal prediction. Let us consider a temporal information se- 
quence, for instance the daily number of railroad travellers from Paris to Lyon. 
Assume we know this number from three years ago up until yesterday. How can 
we forecast the number of travellers of today? Such problems can be tackled 
with usual statistical methods, such as ARMA models. But these methods have 
at least two important drawbacks: they cannot take into account non-stationary 
information and they are unable to deal with non-linear temporal relationships. 
These limitations are the major reasons why the neural network approach has 
been investigated. The problem is now: how to design a connectionist architec- 
ture which is devoted to temporal forecasting? 

D. Bonnet answered this question by the design of the J-NARMA neural 
network. This network takes as inputs the values of the variable from date t — p 
to date t — 1 (p>2). The output is the predicted value of the variable for date 
t. The architecture has two levels: the £-NARMA neuron and the J-NARMA 
network. 

An e-NARMA neuron is a recurrent neuron. But, instead of feeding the 
output value back into the input vector, it feeds the error from t — q to t—1 (i.e. 
the difference between the predicted output value and the real output value) 
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back into the input vector (see figure 5). In all other ways, it acts like a standard 
neuron. 



previous 

errors 




t 



Fig. 5. An e-NARMA neuron. 



There are two main differences with the neuron model of (Frasconi et ah, 
1992): the recurring information, which is the neuron output error in the e- 
NARMA architecture whereas it is the neuron output itself in Frasconi’s model, 
and the number of values which are fed back: only the last one in Frasconi’s 
architecture and the last q values in an e-NARMA neuron. 

A 5-NARMA network consists of an MLP network with e-NARMA neurons. 

The learning algorithm is an adaptation of the classical stochastic back- 
propagation. 

In such a forecasting problem, the time dimension is twofold: an order re- 
lation, and a phenomenon which is captured at periodical time points (day or 
month) . These characteristics are grounded in the architecture through two fea- 
tures: the input vector which takes into account the previous values of the fore- 
casted variable, and recurrent connections which mean that the forecasted value 
depends on the previous errors. 

This architecture has been successfully applied to railroad traffic prediction 
where it gave better results than usual statistical methods (Bonnet et ah, 1997a). 

3.3 Temporal Extensions of Kohonen Maps 

There have been several attempts at integrating temporal information into Self- 
Organizing Feature Maps^®. 

As for other classical connectionist models, a first technique consists of ad- 
ding temporal information externally, on the input of the map. For example, 
exponential averaging of inputs and delay lines were considered (Kangas, 1990; 
Kohonen, 1991). 

Another common method is to use layered maps so that the second map tries 
to capture the spatial dynamics of the input moving on the first map (Kangas, 
1990; Morasso, 1991). 
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also called “Kohonen Maps”. 
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A third approach that has been investigated, consists of integrating memory 
into the map, typically with some exponential decay of activities (e.g. by using 
leaky-integrator neurons) (Privitera & Morasso, 1993; Chappell & Taylor, 1993). 

Another example of this kind of approach is given by the work of (Euliano & 
Principe, 1996). They add a spatio-temporal coupling to Kohonen maps so as to 
create temporally and spatially localized neighbourhoods. The spatio-temporal 
coupling is based on travelling waves of activity which attenuate over time. 
When these travelling waves reinforce one another, temporal activity wavefronts 
are created which are then used to enhance the possibility of a given neuron 
being active^® in the next cycle. 

Finally, more mathematically grounded approaches were developed by (Ko- 
pecz, 1995), (Mozayyani et ah, 1995) or (Chappelier & Grumbach, 1996). 

(Kopecz, 1995) creates a Kohonen map with a lateral coupling structure 
which has symmetric and antisymmetric coupling (for temporal ordering) . Once 
trained, the antisymmetric weights allow active regions of the map to trigger 
other regions in the map, thus reproducing the trained temporal pattern. 

(Mozayyani et ah, 1995) use a coding with complex numbers where the time 
dimension is embedded into the phase of the complex representation. 

(Chappelier & Grumbach, 1996) embed the map into a high dimensional 
space, classifying temporal inputs as functions of time (i.e. map inputs are no 
longer 2 or 3-D vectors but higher dimension vectors, each vector representing a 
function of time). 



3.4 Networks of Spiking Neurons 

Synfire Chains. Synfire chains were proposed by (Abeles, 1982) as a model of 
cortical function. A synfire chain consists of small layers of neurons connected 
together in a feedforward chain so that a wave of activity propagates from layer 
to layer in the chain. Interest in them has grown because they provide a possible 
explanation for otherwise mysterious measurements of precise neural activity 
(so-called “spike”) timings. Many spatio-temporal patterns can be stored in a 
network where each neuron participates in several chains (chains are different 
but have non-empty intersections) although this introduces crosstalk noise which 
ultimately limits the network capacity. This capacity is however non zero allo- 
wing the effective use of such a model. It should furthermore be noted that since 
only a small fraction of neurons are active at a given time, many synfire chains 
can be simultaneously active, providing a possible mechanism for a higher level 
of organization. 



RST. The approach of (Chappelier & Grumbach, 1998) consists of embedding 
spatial dimensions into the network and integrating time at both neuron and 

technically: “to win” . 
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connection levels. The aim is to take both spatial relationships (e.g. as between 
neighbouring pixels in an image) and temporal relationships (e.g. as between 
consecutive images in a video sequence) into account at the architecture level. 

Concerning the spatial aspect, the network is embedded into the actual space 
(2 or 3-D), the metrics of which directly influence its structure through a connec- 
tion distribution function. A given number of neurons is randomly distributed 
in a portion of the space delimited by two planes, the input and output layers. 
Links are then created between the neurons according to their neighbourhood 
in the embedding space. 

For the temporal aspect, they used a leaky-integrator neuron model with 
a refractory period and post-synaptic potentials'^. The implemented model is 
mainly described by two variables: a membrane potential V and a threshold 0. V 
is the sum of a specific potential (7 and an external potential / which stands for 
the input of the neuron. Most of the time V is less than 0. Whenever V reaches 
0, the neuron “fires” . It sends a spike to the downstream neurons and changes 
its state as follows: 0 is increased by some amount called adaptation or fatigue 
and the specific potential U is lowered down to a post-spike value. When the 
neuron does not Are, the variables U and 0 decay exponentially to their resting 
values. 

The input of a given neuron is the sum over space (all the input neurons of 
the considered neuron) and time (all the firing instants of its input neurons). In 
order to provide temporal robustness, the spikes sent by inputs are received as 
post-synaptic potentials described by a function of the kind exp(l — tjT). 

The propagation of neuron spikes in the network as spatiotemporal synchro- 
nized waves enables RST to perform time and space correlation detection, e.g. 
motion detection in a video sequence (see figure 6). Spike synchronization plays 
the main role in RST for Altering static input patterns from moving ones. 



4 Applications 

A major reason of interest in connectionist models of intelligent processes is 
that they have been successfully applied to an impressive number of different 
application domains (e.g. see the book of (Fogelman-Soulie & Gallinari, 1998)). 
In many applications these models are required to deal with time and exhibit 
different dynamic behaviour. 



4.1 Speech Processing 

Most problems from automatic speech recognition are very difficult to address 
using traditional pattern recognition approaches designed for static data types. 
Speech has an inherent dynamic nature and, therefore, an effective model needs 
to be able to capture important temporal dependencies. The use of temporal 

for more details on “integrate and fire” neurons, we refer to the paper of (Gerstner, 
1995) or the book of (MacGregor & Lewis, 1977). 
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Fig. 6. Application of RST network to motion detection in video sequences. The re- 
sponse of the network (i.e. spiking neurons) is superimposed as white squares on the 
original input images. 



connectionist models for speech processing is motivated by a number of different 
reasons: 

— Speech recognition and speech understanding, due to the huge amount of 
variability in the signal, require high learning capabilities; 

— large-scale speech processing projects (e.g. ARPA) have demonstrated the 
importance of high performance at the phonetic level; 

— the statistical hypotheses of the best current models (hidden Markov models) 
are quite restrictive; 

— learning and prior knowledge can be framed homogeneously in connectionist 
models; 

— connectionist models are an intrinsically parallel computational scheme which 
turns out to be useful for very demanding applications. 

Connectionist models have been proposed for both phoneme recognition or isola- 
ted word recognition with interesting results (e.g. see the book of (Gori, 1992)). 



Phoneme Recognition. The first problem consists of coding a given speech 
utterance by means of the corresponding phonemes. The speech signal is typically 
pre-processed so as to produce a sequence of frames, each composed of a vector 
of discriminative features (e.g. spectral parameters). In practice, the frames are 
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produced at a rate^® which is related to the speed of the commands that the 
brain uses to control the articulatory system. A possible approach to predicting 
phonemes is to simply rely on a fixed speech window composed of a predefined 
number of frames. Unfortunately, the information required to predict different 
phonemes is spread over a significantly varying number of frames (Bourlard & 
Morgan, 1994, 1998). RNNS are much better suited for dealing with such a 
problem. The basic problem of choosing a suitable speech window is in fact 
overcome by the inherent dynamical nature of the model. The input can simply 
be taken at frame level and the network is expected to capture the temporal 
dependencies which turn out to be useful for an effective phoneme classification. 
A possible recurrent architecture for this problem can consist of a simple one- 
layer network which takes a single speech frame as input and in which only self- 
loop connections are adopted. The speech signal is processed frame by frame 
along time and for each speech frame the neural network outputs a prediction 
of the corresponding phoneme. It has been pointed out that this architecture 
turns out to be suitable to incorporate the forgetting behaviour that a phoneme 
classifier is expected to exhibit (Bengio et ah, 1992). Basically, the phoneme 
classification is supposed to depend on the speech frame being processed and on 
the close frames, but it is supposed not to depend on remote information. Very 
successful results for the problem of phoneme recognition on the DARPA-TIMIT 
speech data base have been found by (Robinson, 1994), where, in addition to 
the adoption of recurrent architectures, proper integration schemes with hidden 
Markov models are proposed. 



Isolated Word Recognition. In principle, RNNs can also be used for isolated 
word recognition. Isolated words are in fact sequences of speech frames that 
can properly be labelled at the end by a target for specifying the sequence 
membership. Unfortunately, in spite of the experimental efforts of many research 
groups, this direct approach has not produced very successful results yet. There 
are at least two major reasons for this experimental lack of result. First, isolated 
words are sequences composed of hundreds of frames and it is now well-known 
that long-term dependencies are difficult to capture by a gradient-based learning 
algorithm (Bengio et ah, 1994). Hence, RNN-based classifiers develop the trend 
to perform the prediction on the basis of the last part of the word, which is in 
fact a strong limitation especially when dealing with large dictionaries. Second, 
regardless of the architecture and weights, RNNs are difficult to use when the 
number of classes becomes very large. Basically, RNNs inherit the property of 
MLPs in exhibiting very strong discrimination capabilities but also the poor 
scaling with respect to the number of classes. 

4.2 Language Processing 

Theoretical foundations of natural language have been swinging, pendulum-like, 
between fully symbolic-based models, to approaches more or less based on stati- 

in the order of magnitude of 10 ms. 
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sties. Since the renewal of interest in neural networks, the connectionist approach 
has been immediately recognized as a neat way of dealing with the inherent un- 
certainty of natural languages. Languages, however, have also an inherently se- 
quential nature and, therefore, only connectionist models that incorporate time^® 
are good candidates for language processing. Typical tasks in language proces- 
sing propose time as an external variable acting at different levels. For instance, 
lexical analyses require processing letters of a given alphabet in a sequential 
way, whereas in syntactical analyses, time is used to scan different words com- 
posing a sentence. In the last few years, RNNs appeared to be very well-suited 
for performing interesting language processing tasks. 



Prediction of Linguistic Elements. Let us consider the problem of predic- 
ting linguistic elements. A preliminary investigation was carried out by (Servan- 
Schreiber et ah, 1991) concerning the prediction of terminal items for Reber’s 
grammar (Reber, 1976) (see figure 7). 



5 




Fig. 7. An automaton representation of Reber’s finite-state grammar. 



Elman’s recurrent network was used for the experiments. The network had 
an input aimed at coding the symbols and some context units representing the 
state. The output layer had the same number of units as the input layer and 
one-hot coding^° was adopted for the symbols of both layers. For each time step, 
a symbol from the alphabet of a Reber string was provided at the input and the 
recurrent network was asked to predict the next one. The training was carried 
out with a sample of Reber strings and the network subsequently exhibited 
very good generalization capabilities. Interestingly enough, the network develo- 
ped automata-like internal representations and was even capable of performing 
correctly on arbitrarily long sequences. 

at least as an order relation. 

an exclusive coding in which one and only one ontput nenron is high. 
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Early experiments shown in the article of (Elman, 1990) were aimed at predic- 
ting the next word of a given part of a small sentence. The lexical items (inputs 
and outputs) were presented in a localist form using basis vectors. Hence, lexical 
items were orthogonal to one another and there was no encoding of the item’s 
category membership. 

Other interesting language processing tasks were presented by (Elman, 1991). 
His simple RNN was trained to learn the correctness of a given sentence on the 
basis of the presentation of positive and negative examples. For instance, El- 
man studied the problem of detecting the agreement of nouns with their verbs. 
Thus, for example, John feeds dogs and Girls sees Mary are grammatical 
and un-grammatical, respectively. No information concerning the grammatical 
role (subject/object, etc.) is provided to the network. The grammar of the lan- 
guage used in the experiment is given in Table 2 



S NP VP “ . ” 

NP PropN I N I N RC 
VP V (NP) 

RC -S- who NP VP I who VP (NP) 

N — >• boy I girl I cat I dog I boys I girls I cats I dogs 
PropN — >■ John I Mary 

V — >■ chase I feed I see I hear I walk I live I chases I feeds I sees I hears I walks I 
lives 

Additional restrictions: 

— number agreement between N and V within clause, and (where appropriate) between 

head N and subordinate V. 

— verb arguments: 

chase, feed: require a direct object 
see, hear: optionally allow a direct object 
walk, live: preclude a direct object 

(observed also for head/verb relations in relative clauses) 



Table 2. The grammar used by (Elman, 1990) for different language tasks, like noun- 
verb agreement, verb argument structure, and interactions with relative clauses. 



The network is expected to learn that there are items which function as what 
we would call nouns, verbs, etc. and then must learn which items are examples of 
singular or plural, and which nouns are subjects and objects. Related successful 
experiments have been carried out concerning the verb argument structure, and 
the interactions with relative clauses. 



Grammatical Inference. The theoretical result stating that a RNN can be- 
have as an automaton is fully illustrated by some applications of RNNs to na- 
tural language processing. Unlike automata, however, the neural activations are 
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continuous-valued variables and, therefore, an understanding of the network’s 
internal representation developed during the training, is non trivial. RNNs are 
basically adaptive parsers the behaviour of which depends upon the parame- 
ters developed during the training. Formally, an adaptive neural parser can be 
regarded as a 4-tuple {U,X,<P, Z}, where U G is the alphabet of symbols, 
X G is the state, ^(VF) : R^ x R^ — >■ R^ is the state transition function 
which depends on a vector of parameters W G R^ , and Z : R^ — >■ {0, 1} is the 
decision function that decides whether a given state is accepted or not. Given 
a set of labelled examples, one could try to relate the learning of the network 
and the developed internal representation with the grammar which generates 
the language. In the literature, the grammatical inference of the hidden rule is 
stated as the search for a parser capable of classifying the strings. The inference 
process adapts the neural parser to the given learning set by means of a search 
in the parameter space that defines the state transition function <P{W). 

In order to process symbolic strings by neural networks, each symbol of the 
input alphabet S has to be encoded. Basically, E is mapped to a set of vectors 
U = {Ui, . . . ,Us} {Uk G TZ^) and each string of S* corresponds to a sequence 
of vectors that is used as input to the RNN^^. The classification of each string 
is decided looking at the output of neuron N at the end of the input sequence. 

The training set is composed of a set of L pairs (s,d), where s G U* and 
d G {d+,(i“}, being G TZ. The learning algorithm adapts the neural 

network parameters by using a back-propagation through time (e.g. see the paper 
of (Williams & Peng, 1990)). 

When processing symbolic strings by RNNs, the state vector X describes 
complex trajectories. As proposed by (Kolen, 1994), these trajectories can be 
studied in the framework of Iterated Function Systems (IFSs). 

Basically, the symbolic interpretation emerges from partitioning the state 
space into a set of regions that are associated to the states of a finite machine. 
The volume of these regions defines the resolution of the extraction process. 

The number of such regions provides interesting information concerning the 
rule extraction process. The more regions approximate the network trajectories, 
the more detailed description and, consequently, the more likely is the extracted 
machine to have a larger number of states (see figure 8) . Equivalent states can of 
course be removed by using a state minimization algorithm for finite state ma- 
chines. There is experimental evidence that beyond a certain number of clusters, 
the extraction of the finite machine with the corresponding state minimization 
does not produce an increasing number of states but, instead, a maximum value 
is reached. 

These basic steps for performing grammatical inference in the case of finite 
state machines are summarized in figure 9. 

Most problems of language processing have been tackled by using Elman’s 
recurrent network. However, second-order RNNs are more suitable for extracting 

The notation E* denotes the set of all the possible sequences created using the 
vectors contained in E. 
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(a) (b) 



Fig. 8. Finite State Automaton (FSA) extraction algorithm using a neural parser with 
a two-dimensional state space, (a) The number of regions is 7. The extracted FSA 
approximates exactly the network behaviour on all the strings with length up to 6. 
The transition rules from state 1 are shown, (b) The number of regions is 77. This 
number of clusters is necessary in order to have the same behaviour of the network 
on all the strings with length up to 11. The corresponding minimization yields an 
equivalent 31 states machine. 





State space analysis 



Fig. 9. Grammatical inference using neural networks: The learned configuration is 
subsequently used for the extraction of symbolic rules. 
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the internal representation and, consequently for grammatical inference (e.g. see 
the paper of (Miller. & Giles, 1993) or the one of (Omlin & Giles, 1996)). 



Parsing with Simple-Synchrony Networks. Simple Synchrony Networks 
(SSNs) (Lane & Henderson, 1998, 2000) are an extension of RNNs, adding ano- 
ther usage of internal time in order to represent structural constituents^^. This 
extension does however not change the way external time^^ is dealt with. 

More precisely, SSNs can be seen as RNNs of pulsing units, which enables 
them to represent structures and to generalize across structural constituents. 
The SSN approach consists in representing structural constituents directly, rat- 
her than using usual RNN indirect encoding. The structural relationships are 
represented by synchrony of neuron activation pulses, leading to an incremen- 
tal representation of the structure over the constituents. Indeed, SNNs extend 
the incremental outputs of RNNs with as many output neurons as required by 
Temporal Synchrony Variable Binding (Shastri & Ajjanagadde, 1993). The cen- 
tral idea is to divide each time period into several phases, each phase being 
associated with a unique constituent. For an input sentence of n words, the re- 
presentation of the syntactic structure in the output is achieved by the unfolding 
of that structure in a temporal sequence of n phases in which unit synchrony 
represents some relationship between constituents in the structure (for instance 
the father-son relationship). 

SSNs have been successfully applied on standard Natural Language parsing 
problems: taking English sentences drawn from a corpus of naturally occurring 
text, the model incrementally outputs a hierarchical structure representing how 
the words fit together to form constituents (i.e. a parse tree of the input sen- 
tence) . 



5 Extension: From Temporal to Structured Data Types 

The learning of sequential information is the first step toward the adaptive com- 
putation of dynamic data types. RNNs, as presented in section 3.2, were con- 
ceived so as to exhibit a dynamic behaviour for incorporating time, i.e. dealing 
with temporal sequences. 

From the structure point of view, any discrete sequence of real-valued varia- 
bles defined over a time interval can be regarded as a list, which is in fact the 
simplest conceivable dynamic data type. RNNs can therefore be seen as good 
candidates for list processing and even more structured data types. 

Early research in this direction was carried out by (Pollack, 1990) who intro- 
duced the RAAM model, which is capable of dealing with trees with labels in 
the leaves. 



^ non-terminals in the case of Natural Language parsing 
word sequence in the case of Natural Language sentences. 
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Processing Lists. If we represent an input signal sequence U of a RNN by 
a list in which each node, indexed by v, contains a real- valued vector Ut,, the 
general computational scheme described by equations 1 (page 114), and aimed 
at producing the output list Y, can be written: 

X„ = /(g-iX„U„^;,0/) 

Y„ = g(X„z;,0g), 

where 9 “^ is the operator that, when applied to state X^, returns the state X^ 
of the next node of 

The model defined by equation 4 can itself be structured in the sense that 
the generic variable might be independent of q~^Xj^y. Likewise other state- 
ments of independence might involve input-state variables and/or state-output 
variables. An explicit statement of independence is a sort of prior knowledge on 
the mapping that the machine is expected to learn. In general these statements 
can also be different for different nodes and can be conveniently expressed by a 
graphical structure that is referred to as a recursive network. 

Lists can in fact be processed by means of an encoding network which is 
constructed by unfolding the input through the list. The corresponding network 
is created by associating each node of the list with input, state, and output 
variables, respectively. The state variables are connected graphically following 
the reverse direction of the list traversal, and the input and the output variables 
are connected with the associated state variables. 



Generalization to Directed Ordered Acyclic Graphs. A nice extension 
of time sequences can be gained in the framework of dynamic data structures. 
Basically, giving a list corresponds with assigning a set of tokens where an order 
relation is defined. When paying attention to “temporal relations”, a directed 
graph seems to be the most natural extension of list, in the sense that any 
directed graph is a way of defining a partial order over a set of homogeneous 
tokens. In particular, let us consider a directed ordered acyclic graph (DOAG) so 
that for any node v one can identify a set, potentially empty, of ordered children 
ch[v]. For each node, one can extend the next-state equation (4) as follows 

Y^ = g{X^,v,03). 

In the case of binary trees, the state associated with each node is calculated 
as a function of the attached label and of the states associated with the left 
and right children, respectively. The operators and q]^^ make it possible to 
address the information associated with the left and right children of a given 
node and, therefore, straightforwardly generalize the temporal delay operator 
q~^. Of course, for any node, the children must be ordered so as to be able to 

The operator is introduced here in order to make the extension to more complex 
data structures easier. In the context of lists representing temporal sequences, it 
corresponds to the former time step, that is g“^Xt = Xt_i. 
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produce different outputs for binary trees {r,L,R} and {r,R,L}. A list is just 
a special case of a binary tree in which one of the children is null. 

The construction holds for any DOAG provided that a special node s of the 
graph, referred to as the supersource, is given together with the graph. 

The computation of Y„ in the case of graphs is also more involved than the 
one associated with the simple recursive model of equations (4). The DOAG 
somehow represents the state update scheme, i.e. a graphical representation of 
the computation taking place in the recursive neural network (see figure 10). 
This graph plays its own role in the computation process either because of the 
information attached to its nodes or for its topology. This state update graphical 
representation emphasizes the structure of independence of some variables in the 
state-based model of equation 5. For instance, a classic structure of independence 
arises when the connections of any two state variables and only take place 
between components and with the same index i. In the case of lists, 
this assumption means that only local-feedback connections are permitted for the 
state variables. Basically, the knowledge of a recursive network yields topological 
constraints which often make it possible to cut the number of learning parameters 
significantly. 



Ya 




Fig. 10. Gonstruction of the encoding network corresponding to a recursive network 
and a directed ordered acyclic graph. Proper frontier (initial) states are represented by 
squares. The encoding network inherits the structure of the input graph. When making 
the functional dependence explicit, the encoding network becames a neural network, 
which is used to calculate the corresponding output. 



Furthermore, the information attached to the recursive network needs to be 
integrated with a specific choice of functions / and g which must be suitable for 
learning the parameters. A connectionist assumption for functions / and g turns 
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out to be adequate especially to fulfil computational complexity requirements. 
The first-order recursive neural network is one of the simplest architectural choi- 
ces. In this case, equation 5 becomes: 

= O * Q -f 

Y„ = a(C„-X„). 

Matrix A„ G 7^”’" contains the weights associated with the feedback connec- 
tions, whereas matrix G 7^"’™ contains the weights associated with the input- 
neuron connections. Finally, C„ G 7?.^’” is the parameter for the definition of the 
state-output map. These equations produce the next-state and the output va- 
lues by relying on a first-order equation, in which the outputs are bounded by 
using a squashing function^®. An in-depth analysis of this models can be found 
in (Frasconi, Gori, & Sperduti, 1998; Frasconi, 1998). 

Concerning the learning, since the structure of the encoding neural network 
is inherited by both the DOAG and the recursive network, the encoding neu- 
ral network is essentially a feedforward network. Hence the back-propagation 
algorithm for feedforward neural networks can be conveniently extended to data 
structures. The learning algorithm is in this case referred to as hack-propagation 
through structure (Sperduti, 1998). 



6 Conclusion 

We gave in this chapter an overview of what is presently going on in the field of 
temporal connectionist models. The aim was not to be as exhaustive as possible 
but to exhibit the main concepts, ideas and applications in this area. 

Temporal connectionist models already have the power to do arbitrary com- 
putations with time- varying data with the advantage of learning from examples. 
Fundamental theorems about their potential capabilities already exist. They ho- 
wever still need to find efficient ways to be used in practice: existing training 
methods still suffer from their inability to deal with very long time dependencies. 
Although the success so far of capturing and classifying temporal information 
with neural networks is still limited, the approach looks very promising and 
benefits from a rapid growth. 
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In equation 6, <t(-) denotes a vector of squashing functions operating on n of neurons. 
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1 Introduction 

The complexity of learning tasks and their variety, as well as the number of 
different neural networks models for sequence learning is quite high. Moreover, 
in addition to architectural details and training algorithms peculiarities, there 
are other relevant factors which add complexity to the management of a neural 
network for the adaptive processing of sequences. For example, training heu- 
ristics, such as adaptive learning rates, regularization, and pruning, are very 
important, as well as insertion of a priori domain knowledge. All these issues 
must be considered and matched with the complexity of the application domain 
at hand. This means that the successful application of a neural network to a 
real world domain has to answer to several questions on the type of architecture, 
training algorithms, training heuristics, and knowledge insertion, according to 
the problem complexity. 

At present, these questions cannot be easily answered, due to the lack of a 
computational tool encompassing all the relevant issues. We observe that some 
authors [45] [67] [70] [12] [19] [36] [20] [24] tried to unify different architectu- 
res and learning algorithms. However, none of their proposals is complete. The 
same situation is encountered when considering software simulators and neural 
specification languages: all of them are restricted to specific models and do not 
allow the user to develop new models. 

On the basis of these observations, we argue for the need for a Neural Abstract 
Machine, i.e., a formal, and precise definition of the basic (and relevant) objects 
as well as operations which are manipulated and performed, respectively, by 
neural computation. 

To define the Neural Abstract Machine, we suggest to use Abstract State 
Machines (ASMs). ASMs have been extensively used for the formal design and 
analysis of various hardware systems, algorithms and programming languages 
semantics. They allow to specify formal systems in a very simple way, while 
preserving mathematical soundness and completeness. 

In Section 2 we briefly review the different issues arising when considering 
learning sequences. On the basis of this review, in Section 3, we argue about 
the need for a Neural Abstract Machine. Abstract State Machines are briefly 
presented in Section 4 and a very simple example of how they can be applied 
to neural networks is discussed in Sections 5 and 6. Conclusions are drawn in 
Section 7. 



R. Sun and C.L. Giles (Eds.): Sequence Learning, LNAI 1828, pp. 135—161, 2000. 
© Springer- Verlag Berlin Heidelberg 2000 




136 



D. Sona and A. Sperduti 



2 A Brief Overview on Sequence Learning by Neural 
Networks 

In the following we will briefly outline the main issues in sequence learning by 
neural networks. The presence of a large number of different neural architectures 
and learning algorithms is pointed out. The main computational and comple- 
xity known results on architecture power and learning are discussed. The most 
important issues concerning training heuristics and knowledge insertion in re- 
current networks are briefly reported. Moreover, we argue about the difficulties 
in developing successful neural solutions for application problems. 



2.1 Domains, Tasks, and Approaches to Sequence Learning 

Informally, a sequence is a serially ordered set of atomic entities. Sequences (the 
simplest kind of dynamic data structure) typically occur in learning domains 
with temporal structure, where each atom corresponds to a discrete time point. 
For example, variables in a financial forecasting problem are sampled at succes- 
sive instants, yielding an instance space formed by discrete-time sequences of 
observations. Automatic speech recognition systems contain front-end acoustic 
modules that learn to translate sequences of acoustic attributes into sequences 
of phonetic symbols. Other examples of temporal data can be found in problems 
of automatic control or digital signal processing. In all these cases, data gat- 
hering involves a digital sampling process. Hence, serial order in sequences is 
immediately associated with the common physical meaning of time. There are 
however other kinds of data that can be conveniently represented as sequen- 
ces. For example, consider a string of symbols obtained after preprocessing in 
syntactic approaches to pattern recognition. Also consider problems of molecular 
biology, in which DNA chains are represented as strings of symbols associated 
to protein components. Such strings can also be effectively represented as se- 
quences, although time in these cases do not play any role in a strictly physical 
meaning. Time, in the domain of sequences, has therefore the more abstract 
meaning of coordinate used to address simple entities which are serially ordered 
to form a more complex structure. More complex situations arises when consi- 
dering sequences combining both symbolic and numerical data, as may happen 
in medical applications. 

There are different learning tasks involving sequences. Typical tasks are clas- 
sification, time series prediction (with different order of prediction), sequential 
transduction, and control. Sometimes it is also useful to try to approximate pro- 
bability distributions over sequence domains, as well as to discover meaningful 
clusters of sequences. This is particularly useful when performing Knowledge 
Discovery and Data Mining. 

The complexity and variety of problems in sequence learning is so high that 
it would be naive to think that a single approach suffices to master the held. 
According to the nature of data and learning tasks, different approaches has 
been defined and explored, such as Recurrent Neural Networks, Hidden Markov 
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Models, Reinforcement Learning (dynamic programming), Evolutionary Com- 
putation, Rule-based Systems, Fuzzy Systems, and so on. Recently, the feeling 
that a combination of more approaches to face real-world problems is needed 
is emerging in the scientific community. Unfortunately, foundations on how to 
rigorously proceed with this combination have yet to emerge. 

2.2 Representations and RNN Architectures 

Neural networks architectures for sequence learning can be broadly classified into 
two classes, according to the way time is represented: explicitly or implicitly. 

Explicit time representation is also referred to as algebraic representation 
of time, since input and output events at different time steps are explicitly re- 
presented as unrelated variables on which an algebraic model operates, i.e., the 
whole input subsequence from time 1 to time t is mapped into the output using 
a static relationship. From a practical point of view, this means that a buffer 
holding the external inputs to the system, as they are received, must be used. 
Basically, with an explicit representation of time, temporal processing problems 
are converted into spatial processing problems, thus allowing one to use simpler 
static models, such as feed- forward neural networks. Typical architectures be- 
longing to this class of networks are feed-forward networks looking at the input 
sequences through a window of prefixed size and Time-Delay networks, which 
exploit this window approach also for hidden activations. The networks in this 
class have been related with FIR filters, since they can be considered as non- 
linear versions of these filters. Moreover, from a computational point of view, 
this class of networks is strictly related to Definite-Memory Sequential Machines. 

Implicit time representation assumes causality, i.e., the output at time t only 
depends on the present and past inputs. If causality holds, then the memory 
about the past can be stored into an internal state. From a practical point of 
view, internal representations can be obtained by recurrent connections. To this 
class of networks belong Fully Connected networks, NP networks, NARX net- 
works, Recurrent Cascade Correlation networks, and so on. According to the 
type of topology involving the recurrent connections, different types of memory 
can be implemented (e.g., input (transformed) memory, hidden (transformed) 
memory, output (transformed) memory). Moreover, in discrete-time networks, 
different kinds of temporal dependencies can be expressed by resorting to dif- 
ferent Discrete-Time Operators, such as the standard shift operator, the delta 
operator, the gamma operator, the rho operator, and so on. Because of the in- 
ternal state, this class of networks is strictly related to HR filters, and to se- 
veral classes of sequential machines (such as Finite State Sequential Machines, 
Finite-Memory Sequential Machines, etc.). A good overview of all these different 
architectural aspects can be found in [67]. 

2.3 Training Algorithms 

There is a huge variety of training algorithms for recurrent neural networks. 
Almost all training algorithms for recurrent neural networks are based on gra- 
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dient descent. Among these the most popular algorithms are Back-propagation 
Through Time (BPTT) [44], Real Time Recurrent Learning (RTRL) [75] de- 
veloped for on-line training, Kalman (Extended) Filter (EKF) [74] [49], and 
Temporal Difference [62] [63]. As for feed-forward networks [10], second order or 
quasi-second order methods can be defined for recurrent neural networks (see 
[66] for an overview). 

Moreover, there is a class of constructive algorithms which exploit the gra- 
dient to build up the network architecture during training, according to the 
training data complexity. Within this class we can mention Recurrent Cascade- 
Correlation [18], a partition algorithm using Radial Basis Functions [68], and 
Recurrent Neural Trees [60] . 

There is also a class of stochastic learning algorithms for recurrent networks. 
For example, EM [17] (or GEM) can be used to train feed-forward [8] and recur- 
rent networks [42]. Also Evolutionary Algorithms (Genetic Algorithms) can be 
used to train recurrent networks [52] [9]. Notice that Evolutionary Algorithms 
can be considered constructive algorithms, since with a suitable representation 
of the recurrent network, more and more complex networks can evolve within a 
population of networks. 

2.4 Training Heuristics 

The successful training of a neural network can not usually be obtained by just 
running the selected training algorithm on any configuration for the network 
architecture and learning parameters. There are several additional issues which 
must be considered. This is particularly true for recurrent networks. 

For example, the size of the state variable is very relevant. Moreover, an 
important issue is which kind of delay lines should be used in the network, since 
a correct choice may help in capturing the right temporal dependencies, hidden 
into the training data. So usually it is useful to have multi-step delay lines within 
the recurrent network. 

In addition, in order to facilitate training, it may be important to choose 
the right representation for the starting state or even to have the possibility 
to learn it [21]. Also connectivity can be very critical: fully connected networks 
allow for the discovery of high-order correlations, while a sparse connectivity 
can significantly speed up training and return very good solutions where high- 
order correlations are not relevant. Similarly, training times can be reduced by 
using a learning rule which exploits truncated gradients, or by using an adaptive 
learning rate. 

Finally, in order to have some guarantee that the trained network will show 
some generalization capability, regularization [76] [66] and/or pruning (during 
and post training) [31] [48] [66] should be used. 

These are just some of the issues which must be taken in consideration when 
training a recurrent neural network. Thus training a recurrent network is not 
just a problem of choosing a suitable architecture and learning algorithm: several 
different heuristics should be applied to fill in the missing information about the 
learning task. 
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2.5 Computational Power and Learning Facts 

Training heuristics are useful, however it is important to discover computational 
and complexity limitations and strengths of network architectures and learning 
algorithms, since these may help us in avoiding to loose time and resources with 
computational devices which are not suited for the learning task at hand. Several 
results concerning computational power and learning complexity for recurrent 
neural networks have be obtained. 

Concerning computational power, among positive results we can mention 
that recurrent networks can model any first order discrete-time, time invariant, 
non-linear system (see for example [54] [59]). In addition, it has been observed 
that recurrent neural networks with just 2 neurons can exhibit chaos [64] [15]. 
This last result is interesting since it testifies that even a trivial network can 
show a very complex behavior, thus implicitly demonstrating that, in principle, 
very complex computations could be performed by this network. Finally, some 
recurrent architectures, such as fully recurrent and NARX networks, are Turing 
equivalent [56] [57] [55]. 

On the other side, some recurrent architectures have limited computational 
power. For example, single layer recurrent networks cannot represent all Finite 
State Automata [32] (while Elman networks can because of the presence of the 
output layer [39]). Similarly, some constructive methods for RNN (i.e.. Recurrent 
Cascade Correlation) generate networks which are computationally limited [27] 
[38]. 

Concerning learning, it has been proved that gradient descent based algo- 
rithms for RNNs converge for bounded sequences and constraints on learning 
rate [40]. Unfortunately, however, the loading problem [35] for RNNs (i.e., fin- 
ding a set of weights consistent with the training data) is unsolvable [72] ! Moreo- 
ver, even if several recurrent architectures have the computational capability to 
represent arbitrary nonlinear dynamical systems, gradient-based training suffers 
long-term dependencies [11] [41], showing difficulties in learning even very simple 
dynamical behaviors. The problem of long-term dependencies can be understood 
as the inability of gradient descent algorithms (when used on several of the most 
common RNN architectures) to store error information concerning past inputs 
which are far in time from the present input. Some heuristics have been proposed 
to try to reduce the problem of vanishing gradient information [29] [53] [41] [34], 
however, none of them is able to completely remove it. 

2.6 Knowledge Insertion and Refinement 

From Section 2.5 it is clear that RNNs which are computationally very powerful 
are also very difficult to train. In fact, it would be wonderful to have a recurrent 
architecture which is computational complete, i.e., Turing equivalent, and also 
easy to train (see Figure 1). 

Some authors have proposed to exploit a priori information on the application 
domain to master learning complexity. From the point of view of generalization, 
this idea is supported by theoretical results on the decomposition of the error of 
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Learning is difficuit ! 

A 




Complexity of Learning 

Fig. 1. Learning is difiicnlt: as soon as the computational power of the RNN increases, 
the complexity of training the network increases exponentially. It would be great to 
have a powerful RNN which is easy to train. 



a neural network into Bias and Variance [26] . These results suggest that in order 
for a neural network to properly learn the desired function, some significant prior 
structure should be given to the network. 

There are different ways to implement the above idea. One possibility is to 
give some hints to the network on specific properties of the desired function (see 
for example [1] [6] [58] [3] [2] [5] [4]) or to insert prior knowledge in form of rules 
into the neural network and then to train it using a standard learning algorithm 
(for the case of dynamically-driven recurrent neural networks see [16] [22] [7] 
[23] [46]). A variant of this approach considers the possibility to refine rough 
knowledge (see for example [43] [30] [29]) by extracting knowledge coded into 
the neural networks through algorithms which take in input a neural network 
and return a set of rules or a FSM for sequence domains (see for example [47] 
[28] [65] [15]). These rules are then inserted back into the neural network and 
the cycle insertion/training/extraction is repeated several times. 

It is usually believed that the above approach helps learning since it may 
reduce the number of functions that are candidate for the desired function. 
Moreover, due to the inserted knowledge, training times should be reduced. 
Unfortunately, while the above statements may be true for specific and (often) 
small domains, in general there are at least two reasons for the above approach to 
fail. First of all, the insertion of rules or FSM into neural networks implies that a 
good amount of the network structure is predetermined, as well as several of the 
values for the weights. This turns out to create some difficulties to the learning 
process (especially if the learning algorithm is based on gradient descent) , which 
has to satisfy the additional constraints imposed by the knowledge insertion, 
i.e., very often the neural units are saturated and thus they are difficult to train 
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by using a gradient descent approach. Moreover, rule (or DFA) extraction from 
neural networks is not so easy as it was expected. For example, some criticisms 
about the reliability of DFA extraction from recurrent neural networks have been 
raised in [37]. Although a partial solution to these criticisms has been given in 
[73], the problem of reliable extraction of knowledge from neural networks is still 
a research subject. For example, recently, an approach based on reinforcement 
learning for extracting complete action plans from sequences has been proposed 
[61] 

2.7 Application Requirements 

Application fields are many and diverse. This diversity implies the need for 
different approaches, ranging from symbolic to sub-symbolic techniques. Given 
a specific problem, several are the questions which need to be answered for 
a successful, flexible, and portable solution. When using RNNs, some typical 
questions are: 

What is the appropriate RNN architecture (s) for the problem to be solved ? 

What is the appropriate RNN training algorithm(s) ? 

How can a priori knowledge be best used ? 

What to do if no existing architecture/algorithm is suited for the problem to 
be solved (development of new architecture/algorithm ?) 

How can a RNN be integrated with other approaches ? 

The possibility to give correct answers to these questions in a short time is 
related to the availability of specification languages for prototyping and experi- 
mentation. Unfortunately, it must be stressed that even if many neural network 
simulators have been developed (e.g., Aspirin/Migraines, Rochester Connectio- 
nist Simulator, NNSYSID, Stuttgart Neural Network Simulator, Toolkit for Ma- 
thematica, just to mention a few), as well as specification languages for neural 
networks (e.g., EpsiloNN, Neural Simulation Language), they implement a re- 
stricted set of specific models for dealing mostly with static data. At our kno- 
wledge, there is no single specification language which may support the user in 
giving an answer to all the above questions, especially when considering the de- 
velopment of new architectures and/or algorithms. The situation is even worst 
when considering the integration of RNNs with symbolic approaches. 



2.8 Unifying Theories 

The lack of a “universal” neural specification language is mainly due to a lack of 
synthesis of the main concepts and results in the neural network field. Up to now, 
several different architectures and training algorithms have been devised. Many 
of the new improvements however are just small and insignificant changes to 
existing architectures and/or algorithms. Furthermore, since the research is not 
based on a general framework, it is difficult to focus on the study of background 
important properties. For this reason some researchers have felt the need to 
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try to develop, especially for dynamical systems, a general framework able to 
describe the foundations of both architectures and learning algorithms. 

Nerrand et al. [45] describe a general framework that encompasses algorithms 
for feed-forward and recurrent neural networks, and algorithms for parameter 
estimation of non-linear filters. Specifically, feed- forward networks are viewed as 
transversal filters, while recurrent networks as recursive filters. Their approach is 
based on the definition of a canonical form which can be used as a building block 
for the training algorithms based on gradient estimation. A similar approach is 
followed also by Santini et al. [51]. 

Tsoi and Back [69] and Tsoi [67] propose a unifying view of discrete time feed- 
forward and recurrent network architectures, basing their work on systems theory 
with linear dynamics. In this case canonical forms are used to group similar 
architectures and a unifying description is obtained by exploiting a notation 
based on matrices. 

Wan and Beaufays [71] [70] suggest an approach, exploiting /low graph theory, 
to construct and manipulate block diagrams representing neural networks. Gra- 
dient algorithms for temporal neural networks are derived on the basis of a set 
of simple block diagram manipulation rules. 

A computational approach is followed by Berthold et al. [12] [19] [36] [20]. 
In their works Graph Grammars (see [50]) are used to formally specify neural 
networks and their corresponding training algorithms. One of the benefits of 
using this formal framework is the support for proving properties of the training 
algorithms. Moreover, the proposed methodology can be used to design new 
network architectures along with the required training algorithms. 

A proposal for the unification of deterministic and probabilistic learning in 
structured domains, thus including as special cases feed-forward and recurrent 
neural networks, has been proposed by Frasconi et al. [24], where graphical mo- 
dels are used to describe in a unified framework both neural and probabilistic 
(Bayesian Networks) transductions involving data structures. The basic idea is 
to represent functional dependencies within an adaptive device by graphical mo- 
dels. The graphical models are then “unfolded” over the data structure to make 
explicit all the functional dependencies into the data. This process generates an 
encoding network which can be implemented either by a neural or a Bayesian 
network. 



3 Need for a Neural Abstract Machine 

As argued in Section 2.8, current neural network simulators and specification 
languages have too many drawbacks in order to be a valid tool for the develop- 
ment of successful neural solutions to applications problems. First of all they 
are too specific, since they typically implement a restricted set of neural mo- 
dels. Then, it is usually not possible, or very difficult, to slightly modify the 
specification of a given standard model, to combine different models, to insert a 
priori knowledge, and to develop a new model. Only some of them can deal with 
sequences (or structures). Finally, as pointed out in [25], in several application 
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domains, it is not possible to assume both causality and stationarity, and at our 
knowledge none of them is able to cope with these requirements. 

What we really need is to further develop the unifying approaches described 
in the previous section so to reach the full specification of a Neural Abstract 
Machine, i.e., a “universal” theory of neural computation, where all the basic 
and relevant neural concepts, and only them, are formally defined and used. For 
example, we must have the possibility to give a specification of input data types 
(static vectors, sequences, structures) and how to represent them (e.g., one-hot 
encoding, distributed representation, etc.), i.e., the object to be manipulated 
by the neural network. Then we must have the possibility to specify the opera- 
tion types and their representation, e.g., functional dependencies, deterministic o 
probabilistic functions, gradient propagation, growing operators, compositional 
rules, pruning, knowledge insertion, shift operators, weight sharing, and so on. 
Finally, we have to master basic computational concepts and their implementa- 
tion, e.g., model of computation, unrolling in time, unfolding on the structure, 
(non-)causality, (non-)stationarity, and so on. 

When all these entities are defined and ways of implementing them are spe- 
cified, to face an application problem we just have to write a few lines of “neural 
code” based on the neural abstract machine!! What we suggest is to perform 
a computational synthesis of the relevant issues in neural computing, recogni- 
zing the basic atoms of neural computation and how these basic atoms can be 
combined in order both to reproduce known neural models and to develop new 
architectures and learning algorithms, without the need to recode everything in 
a standard programming language such as Java, or C++. By using an analogy 
with the history of computers, we have to move from combinatorial or sequen- 
tial circuits (current neural networks) to a Von Neumann machine (the Neural 
Abstract Machine). 



4 Abstract State Machines 



In order to devise a formal description of the Neural Abstract Machine, we need 
to decide which type of specification method to use. It is difficult to take such a 
decision choosing among many formal methods, since a lot of available theories 
are not practical for the description of complex dynamic real-world systems [14] . 
We think that the Evolving Algebras, devised by Gurevich [33], present some use- 
ful features for the Neural Abstract Machine formal development. Furthermore, 
this formalism has been intensively used for the formal design and analysis of 
various hardware systems, algorithms and programming languages semantics, 
showing an astonishing simplicity while preserving mathematical soundness and 
completeness. The main reason for its simplicity is the imperative specification 
style that, in contrast to conventional algebraic specification methods, allows an 
easier understandability. This leads to an easy definition of formal and informal 
requirements, turning them into a satisfactory ground model [14], i.e. a formal 
model that satisfies all the system requirements. Moreover, as stated in Borger’s 
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work [13], the evolving algebras have many other features that can help when 
devising a system or defining a language semantics: 

— The freedom of abstraction, that allows incremental development of systems 
by stepwise refinement through a vertical hierarchy of intermediate models; 

— The information hiding and the precise definition of interfaces, that helps 
the horizontal structuration of modules. In practice, when using a function 
we do not care about its implementations, we are interested only in the 
interface; 

— The scalability to complex real-world systems; 

— The easy learning and usability of the model. 



4.1 Domains and Dynamic Functions 

When specifying a system or a language semantics some entities must be defined. 
In particular, the basic object classes and the set of elementary operations on 
objects, i.e. the basic domains and functions. Each domain represents a category 
of elements that contributes to the definition of the whole system. These domains 
are completely abstract, since they represent some sort of information that will 
be specified later. For example, a generic neural network is built up by a linked set 
of computational units. In order to formalize a neural network with an Evolving 
Algebra {EA), we may define two basic domains: the NEURONS set and the 
CHANNELS set. Note that this is only one of all possible formalizations of the 
basic elements of a neural network. Even if usually within the EA framework 
the domains are static, it is also possible to have dynamic domains [33]. This is 
very useful for our project, since there are many neural architectures that grow 
or shrink (through pruning) during training. Furthermore, the time unfolding 
of recurrent networks and the graph unfolding of recursive networks is done at 
run-time, thus we can not assume to have static networks. 

Once the system domains have been defined, the set of all properties and 
elementary operations must be specified by corresponding functions. A function 
is defined in a mathematical sense as: 

A function is a set of (n -I- l)-tuples, where the (n -I- l)-th element is 

functionally dependent from the first n elements (its arguments) [14]. 

Typically, each system operation updates a value (e.g. a memory location in an 
hardware system) given a set of other values (e.g registers). In the ASM frame- 
work this corresponds to the dynamic function update or destructive assignment 
defined as: 

/(<!,..., tn) ^ t 

where / is an arbitrary n-ary function as defined above, t\, . . . ,tn are the function 
parameters defined in some domains, and t is the value at which the function 
is set. The EAs are so general that each used term could be of any complexity 
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or abstraction. Continuing with the previous neural example, we can define the 
function that given a channel returns its strength parameter as: 

weight : CHANNELS — >■ M . 

Since during training the weights of the network are changed, the update of one 
channel weight can be formalized as: 

weight{Ci) t— weight{Ci) + Sw{Ci) , 

where Ci is an object belonging to the domain CHANNELS, and 5^, is a function 
that returns the amount of weight update associated to the specified channel (a 
real number). 

4.2 Abstract States and Transition Instructions 

Thanks to the EAA freedom of abstraction and information hiding properties, it 
is possible to produce rigorous high level specifications without worrying about 
the future design. This is accomplished using the concepts of abstract state and 
abstract transition function. It is for this reason that Evolving Algebras are 
nowadays also termed Abstract State Machines (ASM). The ASM concept of 
abstract state should not be confused with the notion of state used in finite 
state automata: 

The ASM abstract state is a collection of domains and dynamic functions 
defined on the domains. 

The abstract states are subject to integrity constraints that partially describe 
the machine behavior. More clearly, when a system is in a state, the set of all 
reachable states is limited by the machine specification. Even if the notation 
used for the constraints formulation is not limited by any programming langu- 
age, in order to describe the ASM behavior a set of basic actions have been 
designed. These actions constitute the set of abstract transition functions also 
termed machine abstract instructions. The dynamic function updates are the 
basic operations by which the behavior of a system can be described. However, 
they are inadequate for a complete system description, so the model needs to 
be enriched with control operators, frequently termed rules. The most general 
of these operators is the guarded assignment: 

if Cond then Updates 

where Cond is a condition (the constraint), and Updates consists of finitely many 
function updates, which are executed simultaneously. At this point we have all 
the ingredients for an ASM system behavior description [13,14]: 

Definition of Abstract State Machines. An ASM At is a finite set 
of guarded function updates, used for evolving step by step the machine. 
When At is in a state S, all guard conditions are evaluated (with stan- 
dard logic), and all instructions with the verified guard condition are 
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selected. Then, all the update functions of all selected instructions are 
simultaneously executed, transforming the state S into a new state S' . 
This procedure of the ASM Ai is iteratively applied as long as possible 
(i.e. until there are not verified rule conditions). The ASM run can be 
defined as the set of all transitions that bring the machine Ai from the 
initial state S to the final state S' , where no more rules have the verified 
guard. 

The above definition shows how simple is the ASMs concept, since it is the only 
notion one has to know about the ASM semantics in order to be able to use this 
formalism. 

4.3 ASM Rules and Graphical Representation 

As previously stated, all the needed ingredients for a system behavior description 
with ASMs have been introduced. Nevertheless, the model is so general that a 
free use of programming notation is permitted. Even if a general rule for ASMs is 
that, unless there is an important reason, it is better to avoid the use of complex 
non-standard concepts or notations, in order to simplify the designer work, some 
simple and generic rules have been introduced in the ASMs formalism: 

skip; 

^ Pifli ■ ■ • ) im)'j 

— forall / in Dom such that Cond(f) 

Rule; 

— choose / in Dom such that Cond(f) 

Rule; 

where, / is a function signature, Dom is a domain to which the given signature 
must belong, and Cond is an arbitrary condition on the function. The first rule, 
obviously, does nothing. The second rule represents a collection of rules. The 
third and the fourth rules allow a selection of rules to be executed. Note that 
with the last rule a sort of explicit nondeterminism has been introduced. An 
alternative nondeterministic solution could be to modify the ASMs semantic 
definition by allowing the firing at each step of only one of all fireable updates. 

In order to further simplify the design phase of a system also a graphical 
representation for the previous rules has been introduced. The guarded update 
is an instruction that allows the ASM transition from a state 5 to a state S' , 
thus the guarded update rule can be rewritten as: 

if {currentstate = S) h {Cond) then 
Updates 

currentstate ^ S' 

Which can be graphically represented by the following diagram: 
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Fig. 2. A graphical representation of the required communication messages between 
channels and neurons. The signals u and y are the neuron input and output, 5 is the 
gradient information computed by the neuron, and 5' = 5w is the gradient information 
computed by the channel. 



Note that the rectangular box may also contain a set of rules. This is very hel- 
pful when formalizing a problem with a top-down approach, because a complex 
concept can be expressed by a box, and left for future expansion. 



5 A Sketch of Feed-Forward Network Specification 

Our approach to the Neural Abstract Machine (NAM) is based on the idea of a 
kernel able to process feed-forward neural networks. In this section we provide a 
simplified high level formal description of such kernel, adopting a joined bottom- 
up and top-down approach, and showing some operational details with ASMs. 
Note that this is only a (not rigorous) exercise in order to give an idea about 
how ASMs technology can be applied. In particular we show the main features 
on which we are working for the NAM project. 



5.1 Neurons and Channels Domains Specification 

As stated in Section 4.1, in order to design an ASM, the set of basic domains must 
be defined. In our framework, since a net is viewed as a collection of neurons and 
connections (channels), we define the NEURONS and the CHANNELS domains, 
which represent the basic block categories compounding a neural network. 

In view of the learning process, besides to neurons, we can assume that 
channels are “active” entities^, i.e., able to update autonomously the associated 
weight, on the basis of the gradient information. In Figure 2 all the required 
communications between a channel and a neuron unit are shown. 

Our basic assumption is that the kernel of the system is based on a Neu- 
ral Control Machine (NCM), which dynamically generates the neural network 
connecting the basic blocks (channels and neurons), controls the flow of compu- 
tation, furnishes the input data, and eventually the error information. The NCM 

^ The assumption that a channel is an active entity is not a requirement. According 
to the actual implementation, it may also be a passive entity accessed and modified 
by a Neural Control Machine. 
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should be able to manipulate also constructive algorithms, such as Cascade Cor- 
relation. In order to have this ability, the NCM should be able to communicate 
with each channel or neuron during the operative phase, sending signals such as 
freeze (or de-freeze). Moreover, the NCM needs to be able to dynamically create 
or destroy units during training, thus changing the neural architecture. The ad- 
vantage of using the ASMs formal specification method is that such operations 
can be easily formalized with suitable function assignments. 

5.2 Architecture Manipulation Functions 

As previously stated, we have two basic block categories, represented by the 
NEURONS and CHANNELS domains. 

In order to control the basic behavior of neurons and channels we need to 
define some functions over the domains NEURONS and CHANNELS. For in- 
stance, we need functions that, given a computational object, belonging to one 
of the two domains, return the “names” of all the other objects connected to 
it. Thus, for each domain, we need two functions, souree and dest, which return 
the sources and the destinations for a given unit (A) or channel (C) during the 
forward phase: 



sourcec : CHANNELS NEURONS 

destc : CHANNELS NEURONS 

soureeN : NEURONS ^ V{CHANNELS) 
destN : NEURONS ^ P{CHANNELS) 

where P denotes the power set. In our formalization while a neuron may have 
many source and destination channels, a channel has only one source and one 
destination neuron. Nevertheless, higher order connections, which have multiple 
sources and/or multiple destinations, can be easily modeled by the following new 
definition: 



sourcec ■ CHANNELS ^ P(NEURONS). 

Since the Neural Control Machine needs to access all resources of the imple- 
mented neural network (neurons and channels), two functions are required: 

alLneurons : P(NEURONS) 
alLchannels : P (CHANNELS) 

which return the set of all neurons and the set of all channels used by the network. 
Furthermore, in order to know which are the input and the output units of the 
neural network the NCM requires the following two functions: 



inJayer -.PiNEURONS) 
outJayer : P (NEURONS) 
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We assume that the NCM uses a data flow computational paradigm, i.e. 
the computation starts when the NCM transmits data to the inJayer units, 
and flnishes when all units in the outJayer have computed the output, and no 
computation for other units needs to be performed. Moreover, it is responsibility 
of the NCM to implement the backward propagation of the gradient information 
across the network. Even in this case, a data flow computation is performed. 

The previously defined functions allow the generation of a neural network. 
Actually, after the creation of a set of computational units € NEURONS and 
Cij G CHANNELS, the network can be built up by a set of simple assignments 
like the followings: 



sourcec(cij) ^ rij 
destc{cij) G- rii 

sourceN{rii) G- {cih, • . • , Cik} 



5.3 Units Computation Functions 

Each computational unit behavior may be described by three basic functions for 
each direction of the flow of computation (either forward or backward) : the input 
function, the state transition function and the output function. In the following 
we show the functions characterizing the forward computation of the neural 
units. 

The input function of the neural units during the forward phase could be 
described by the following interface: 

injorwj^ : NEURONS ^ INPUT, 

where INPUT is a new domain introduced for generalize the data description. 
With this approach the specifications can be left for future refinements, however, 
we can imagine that each channel transmits to a neuron a pair of data, formed 
by the signal conveyed by the channel and the weight associated to the channel. 
In this way, each neuron can use any internal computation function. For this 
reason we can assume INPUT = V{{IR, M)). As a result of this choice, accessing 
the input of a neuron with the function injorw, a list of couples is returned. 

Obviously, data are returned only if previously sent to the neuron by all 
channels. For this reason when data are ready on the output interface of all 
channels they are copied in the input function with the following assignment: 



inJorWf^{ni) G- {outjorwdcih), ■■■, ouLforwc{cik)} 

where Cki, ■ ■ ■ , cm could be previously determined using the function sourccN{ni), 
and ouLforwc is the channel output function. This way of processing emphasizes 
the data-flow paradigm implicitly assumed by our system. In fact, the internal 
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computation can be done only when all data are ready in the input side of the 
neuron, i.e. when the neural unit possesses all the required inputs. 

At this point the neurons can carry out the internal computations of the 
input data. In order to do this the following functions are required: 

compute -forwjsi : INPUT — >• STATE 
output -function jsi : STATED OUTPUT 

The function compute-forwN computes the internal state of the neuron (e.g. the 
net value), and the function output-function^ computes the output value starting 
from the internal state. As previously stated for the INPUT domain, also the 
STATE and the OUTPUT domains can be left unspecified for future refinements, 
however, for the sake of presentation, we assume STATE = OUTPUT = M. 
Note that, the state is not strictly necessary in feed-forward networks, however 
it allows the storing of information needed by the backward phase. 

Two other functions are needed for accessing the STATE and OUTPUT 
values of the neuron: 



state-forwff : NEURONS^ STATE 
out-forwff : NEURONS^ OUTPUT 

They can be instantiated as follows: 

state-forwpf{ni) <— compute-forW[^{in-forw^{ni)) 
out-forwj^{ni) -fr- output-function {state -forwj^{ni)) 

Note that the function compute -f or wm does not take into account the pre- 
vious state, so there is no memory of the past. If the memory is required, as in 
neural networks with short term memory, the function interface and the state 
transition may be changed in the following way: 

compute-forwj^ : STATE x INPUT ^ STATE 
state-forwj^{ni) <— compute-forwj^{state-forwj^{ni),m-forwj^{ni)) 

Even if the proposed functions give just a partial definition for the feed- 
forward computation of a neural unit, they show how a neural model can be 
devised with ASMs. The backward propagation of the neuron unit is not much 
different from the forward propagation, and also the channel functions (either 
forward or backward) are similar to the neuron functions. It is theoretically 
interesting to note that, with such a block approach, neurons and channels are 
very similar. In an object oriented programming language the NEURONS and 
CHANNELS domains could be two classes inheriting from the same class. 

5.4 The Neural Control Machine 

Here, we are going to describe a possible formalization of the feed-forward part 
of the Neural Control Machine. In Figure 3 we show the part of the NCM, that 
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Fig. 3. This flowchart shows the part of NCM involved in the forward propagation of 
data in a feed-forward neural network. The NCM generates a neural network with a 
specified topology, then it furnishes an input to the network and propagates it through 
the network. 



starting from a clear blackboard, creates a network and forward propagates a 
given input data till an output result is returned. 

The first step of the NCM is the creation of the neural network. This can 
be accomplished creating all the neurons and all the required channels. Then, 
all the objects are linked. When the network is ready, the forward propagation 
of one input data can be accomplished. In order to do this, the data is sent to 
the input neurons of the network (known through the function in-layer). After 
that, the propagation of data starts. The machine iteratively propagates data 
through all neurons and channels. When the network output is ready, the NCM 
can collect it from the output units (known through the function out-layer). 

In order to show how the freedom of abstraction property of the ASMs can 
help when incrementally developing by stepwise refinement, we have refined the 
neuron computation procedure left undefined in Figure 3. The specification of 
this computation is given in Figure 4. 

When for a neuron the data is ready in input, it is copied into an internal 
working area for fast and easy access during the further computation. At this 
point the internal state is computed using the function eompute-forw]\f and the 
result is internally stored assigning it to the function state-forwN . Finally, the 
output result can be computed using the function output-functiouN over the 
internal state and the result is stored by means of the function out-forwN ■ 
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Fig. 4. This flowchart is an example of refinement of the NEURON COMPUTATION 
left unexplained in the flowchart of Figure 3. When a data is ready in inpnt, it is 
assigned to the internal memory working area, then the internal state is computed, 
and finally the output result of the neuron is computed. 



6 Some Details of the Neural Abstract Machine for 
Sequences 

We have previously explained the basic behavior of the NAM kernel able to pro- 
cess feed-forward networks. In this section we provide a simple high level formal 
description of the part of Neural Abstract Machine devoted to the treatment of 
recurrent networks for structured data. 

As previously stated (see Section 3), in order to define the NAM we need 
a general unifying theory of all different types of neural networks architectures 
and training algorithms. Even if some unification works have already appeared, 
we use here only a limited set of such ideas, since most of them are addressed 
to the internal organization of networks and their learning algorithms. 

In particular, the work by Tsoi and Back [69] and Tsoi [67] showed that 
several recurrent networks can be reduced to a canonical recurrent form. Fur- 
thermore, if all cycles in the network are controlled by a delay operator, the 
forward and learning operations of a recurrent network can be easily reduced 
to the forward and learning operations of a feed-forward encoding network with 
identical behavior and weights^. For example, when considering a recurrent net- 
work where internal loops are controlled by a delay operator of one time step (i.e 
the system transitions could be represented by the following Mealy model: 

Xt = and Yt = gt{Xt,Ut), (1) 



2 



The weights of the recurrent network are shared among all layers of the transformed 
net. 
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where Xt is the internal state at time t, Ut is the input at time t, Yt is the 
computed output at time t, ft and gt are the transition function and the out- 
put function at time t. Note that, even if the state transition is recursive, the 
functions ft and gt are intrinsically non-recursive, in fact the past information 
(i.e. the previous system computation) does not belong to the function ft but is 
given to it as an external information. This show that the recursive system can 
be easily modeled by a suitable composition of non-recursive functions. Note also 
that the given specification assumes the possibility of a non stationary system, 
where functions ft and gt change over time. 

It is clear at this point that given a data set and a (recurrent) network spe- 
cification, the Neural Abstract Machine should be able to devise the parametric 
non-recursive functions ft and gt by which to define the feed-forward neural net- 
work that behaves as the recurrent network. In other words the NAM should 
be able to use the formal description^ of a (recurrent) neural network in con- 
junction with the data set in order to find the optimal solution for the network 
parameters. 

For a better understanding of the expected behavior of the NAM, let us 
reconsider Equations 1, which explain how the recurrent network can be dyna- 
mically unfolded over a sequence, generating a feed-forward network (also called 
encoding network). Informally, the encoding network is derived by dynamically 
unfolding the temporal operations of the recurrent network over each element of 
the observed sequence. Specifically, the recurrent network without the delayed 
links is replicated for each element of the sequence, and then each delayed link 
from neural unit i to unit j is mapped into a corresponding link from unit i to 
unit j of adjacent layers in the new feed-forward network (see Figure 5). In this 
way, for each input sequence a different feed- forward network is generated. 

Since we can assume that all recurrent neural networks can be reduced to 
feed-forward networks, we now try to formalize a machine able to apply this 
transformation . 



6.1 The Unfolding Machine 

Let us study the high-level description of the unfolding process when the input 
data is constituted by sequences. The process to which we are interested in can 
be easily expressed by the (ASMs based) flowchart shown in Figure 6, where the 
first part of a sequence processing is shown. 

Notice that the machine needs to access a database of sequences located 
somewhere. We are not interested in details such as how it is made and where it 
is stored. As first operation the machine “loads” a sequence from the database 
into the machine working area. This is then followed by the unfolding operation. 
The cycle that controls the unfolding is based on the idea that a sequence can 
be represented as a directed graph composed of nodes connected as shown in the 
following: 

® In order to give a formal description of a neural network, also a Neural Language 
should be defined. 
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Fig. 5. Given an input sequence, a recurrent network (a) can be unfolded in time, 
originating an equivalent feed-forward network called encoding network (b). 






Time 



Specifically, note that we assume the inverse orientation of the arcs with 
respect to the time of elaboration. The reason is that, using this approach, we 
can further extend the machine to the processing of structures, such as trees, 
forests and more in general DOAGs (Directed Ordered Acyclic Graphs). In fact, 
a recurrent network can be applied to structured data by unfolding it over the 
structure in a way which is very similar to the one used for sequences^. 

In the flowchart there are not operational details about the algorithm, there 
is only a (very) high level description of the unfolding operation. The first step 
only asserts that a sequence must be loaded for future processing. The second 
step is based on the assumption that all the elements (termed nodes from now 
on) of the loaded sequence can be marked as “visited” or “not visited”. The 
second step is a loop over all not visited nodes, stating that at each iteration, for 
each element of the frontier, the encoding network must be extended unfolding 
the recurrent network. 

Note that sequences are a particular instance of DOAG. 
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Fig. 6. Flowchart for the processing of sequences by a recurrent network. The feed- 
forward encoding network, for each sequence, is generated by unfolding the recurrent 
network on the input sequence. Notice that the machine is defined to work also for 
structured data, such as trees. 

During the first iteration, the frontier is constituted by all leaves of the data 
structure (i.e. the first element of the sequence). From the second step on, the 
frontier is determined by all those nodes still not visited for which all sons have 
been visited. The loop stops when the roots are reached (i.e. the last element 
of the sequence). When there are no more undefined elements (i.e. all elements 
of the data have been used for the unfolding operation) the Unfolding Machine 
leaves the control to the Pruning Machine, which, in the case of complex data 
structure, removes portions of the encoding network, for which no gradient in- 
formation must be computed. At the end of this process the encoding network 
is ready to be used. 

The mathematical description of the frontier computation is based on some 
simple functions. The function visited{n) returns a boolean value indicating 
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whether the node n has been visited during the unfolding loop, the function 
leaj{n) returns a boolean value that signal whether a node is a leaf of the struc- 
ture, and finally, son{n) returns the set of all nodes m that have the node n as 
father. 

The most important and complex part of the Unfolding Machine is the fun- 
ction unfold{n) that, for each node n of the structured data (e.g. an element of 
a sequence), must dynamically create the feed- forward encoding network. 

7 Conclusions 

Sequence Processing is a very complex task. According to the learning problem 
different approaches can be used. Neural networks for sequence are especially 
useful when data is numerical and noisy, or when uncertainty characterizes the 
learning task. 

Training a recurrent neural network, however, is not a trivial task. Several 
choices about the network architecture, as well as the training algorithm must 
be taken. Moreover, heuristics must be used to set the learning rates, the regu- 
larization tools, the pruning procedure, and so on. 

When present, a priori knowledge should be used to reduce the burden of 
training a network. In some cases, this may speed up learning and improve 
the generalization ability of the network. However, in general, learning remains 
difficult, also considering the long-term dependencies. 

To successfully apply recurrent neural networks in real world domains, all 
the above choices must be taken in the shortest time and in the most reliable 
way. At present, no software tool is available to support all the aspects described 
above, including the rapid development of ad hoc new neural network models. 

Recalling some recent attempts to find a unifying theory for (recurrent) neu- 
ral networks, we suggested to push this challenge further, since this would create 
the theoretical basis on which to build a Neural Abstract Machine. This machine 
could be used to formally define a sort of neural programming language for fast 
prototyping and development of neural solutions to sequence (and structured) 
learning problems. As computational formalism we suggested to use the Abstract 
State Machines (ASM), which allow to describe the Neural Abstract Machine at 
different levels of abstraction and detail, while preserving simplicity of presen- 
tation and comprehension. A very brief and preliminary example of how to use 
ASM for neural networks was discussed. 

Finally, we stress that the development of a Neural Abstract Machine would: 
1 ) help in understanding what is really needed and worth to be used in (recur- 
rent) neural networks; 2) create a universal computational language for neural 
computation; 3) improve the possibility to integrate, at a computational level, 
neural networks with other approaches such as expert systems, Bayesian net- 
works, fuzzy systems, and so on. 




On the Need for a Neural Abstract Machine 



157 



References 

1. Abu-Mostafa, Y. S., 1990. Learning from Hints in Neural Networks. Journal of 
Complexity 6:192-198. 

2. Abu-Mostafa, Y. S., 1993a. Hints and the VC Dimension. Neural Computation 5, 
no. 2:278-288. 

3. Abu-Mostafa, Y. S., 1993b. A Method for Learning From Hints. In Advances in 
Neural Information Processing Systems, eds. S. J. Hanson, J. D. Cowan, and C. L. 
Giles, vol. 5, pp. 73-80. Morgan Kaufmann, San Mateo, CA. 

4. Abu-Mostafa, Y. S., 1995a. Financial Applications of Learning from Hints. In Ad- 
vances in Neural Information Processing Systems, eds. G. Tesauro, D. Touretzky, 
and T. Leen, vol. 7, pp. 411-418. The MIT Press. 

5. Abu-Mostafa, Y. S., 1995b. Hints. Neural Computation 7, no. 4:639-671. 

6. Al-Mashouq, K. A. and Reed, I. S., 1991. Including Hints in Training Neural Nets. 
Neural Computation 3, no. 3:418-427. 

7. Alquezar, R. and Sanfeliu, A., 1995. An Algebraic Framework to Represent Finite 
State Machines in Single-Layer Recurrent Neural Networks. Neural Computation 
7, no. 5:931-949. 

8. Amari, S., 1995. Information Geometry of the EM and em Algorithms for Neural 
Networks. Neural Networks 8, no. 9:1379-1408. 

9. Angeline, P. J., Saunders, G. M., and Pollack, J. P., 1994. An Evolutionary Algo- 
rithm That Gonstructs Recurrent Neural Networks. IEEE Transactions on Neural 
Networks 5, no. 1:54-65. 

10. Battiti, T., 1992. First- and Second-Order Methods for Learning: Between Steepest 
Descent and Newton’s Method. Neural Computation 4, no. 2:141-166. 

11. Bengio, Y., Simard, P., and Frasconi, P., 1994. Learning Long-Term Dependencies 
with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5, 
no. 2:157-166. 

12. Berthold, M. and Fischer, L, 1997. Formalizing Neural Networks Using Graph 
Transformations. In Proceedings of the IEEE International Conference on Neural 
Networks, vol. 1, pp. 275-280. IEEE. 

13. Borger, E., 1995. Why Use Evolving Algebras for Hardware and Software Enginee- 
ring? In SOESEM’95, 22nd Seminar on Current Trends in Theory and Practice of 
Informatics, ed. J. W. Miroslav BARTOSEK, Jan STAUDEK, vol. 1012 of Lecture 
Notes in Computer Science, pp. 236-271. Berlin Heidelberg New York: Springer- 
Verlag. 

14. Borger, E., 1999. High Level System Design and Analysis using Abstract State 
Machines. In Current Trends in Applied Formal Methods (FM-Trends 98), eds. 
D. Hutter, W. Stephan, P. Traverse, and M. Ullmann, vol. 1641 of Lecture Notes 
in Computer Science, pp. 1-43. Berlin Heidelberg New York: Springer- Verlag. 

15. Gasey, M., 1996. The Dynamics of Discrete-Time Computation, with Applica- 
tion to Recurrent Neural Networks and Finite State Machine Extraction. Neural 
Computation 8, no. 6:1135-1178. 

16. Das, S., Giles, C. L., and Sun, G. Z., 1992. Learning Gontext-free Grammars: 
Limitations of a Recurrent Neural Network with an External Stack Memory. In 
Proceedings of The Fourteenth Annual Conference of the Cognitive Science Society, 
pp. 791-795. San Mateo, CA: Morgan Kaufmann Publishers. 

17. Dempster, A. P., Laird, N. M., and Rubin, D. B., 1977. Maximum likelihood from 
incomplete data via the EM algorithm (with discussion). Journal of the Royal 
Statistical Society series B 39:1-38. 




158 D. Sona and A. Sperduti 



18. Fahlman, S., 1991. The Recurrent Cascade-Correlation Architecture. In Advances 
in Neural Information Processing Systems 3, eds. R. Lippmann, J. Moody, and 
D. Touretzky, pp. 190-196. San Mateo, CA: Morgan Kaufmann Publishers. 

19. Fischer, I., Koch, M., and Berthold, M. R., 1998a. Proving Properties of Neural 
Networks with Graph Transformations. In Proceedings of the IEEE International 
Joint Conference on Neural Networks, pp. 457-456. Anchorage, Alaska. 

20. Fischer, L, Koch, M., and Berthold, M. R., 1998b. Showing the Equivalence of 
Two Training Algorithms - Part2. In Proceedings of the IEEE International Joint 
Conference on Neural Networks, pp. 441-446. Anchorage, Alaska. 

21. Forcada, M. L. and Carrasco, R. C., 1995. Learning the Initial State of a Second- 
Order Recurrent Neural Network during Regular-Language Inference. Neural Com- 
putation 7, no. 5:923-930. 

22. Frasconi, P., Gori, M., Maggini, M., and Soda, G., 1991. A Unified Approach for 
Integrating Explicit Knowledge and Learning by Example in Recurrent Networks. 
In International Joint Confernece on Neural Networks, pp. 811-816. 

23. Frasconi, P., Gori, M., and Soda, G., 1995. Recurrent Neural Networks and Prior 
Knowledge for Sequence Processing: A Constrained Nondeterministic Approach. 
Knowledge Based Systems 8, no. 6:313-332. 

24. Frasconi, P., Gori, M., and Sperduti, A., 1998. A General Framework for Adap- 
tive Processing of Data Structures. IEEE Transactions on Neural Networks 9, 
no. 5:768-786. 

25. Frasconi, P., Gori, M., and Sperduti, A., 2000. Integration of Graphical-Based Ru- 
les with Adaptive Learning of Structured Information. In Hybrid Neural Symbolic 
Integration, eds. S. Wermter and R. Sun. Springer- Verlag. To appear. 

26. Geman, S., Bienenstock, E., and Doursat, R., 1992. Neural Networks and the 
Bias/Variance Dilemma. Neural Computation 4, no. 1:1-58. 

27. Giles, C. L., Chen, D., Sun, G.-Z., Chen, H.-H., Lee, Y.-C., and Goudreau, M. W., 
1995. Constructive Learning of Recurrent Neural Networks: Limitations of Re- 
current Casade Correlation and a Simple Solution. IEEE Transactions on Neural 
Networks 6, no. 4:829-836. 

28. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C., 
1992. Learning and Extracted Finite State Automata with Second-Order Recurrent 
Neural Networks. Neural Computation 4, no. 3:393-405. 

29. Giles, G. L. and Omlin, C. W., 1993a. Extraction, Insertion and Refinement of 
Symbolic Rules in Dynamically-Driven Recurrent Neural Networks. Connection 
Science 5, no. 3:307-337. 

30. Giles, C. L. and Omlin, C. W., 1993b. Rule Refinement with Recurrent Neural 
Networks. In 1993 IEEE International Conference on Neural Networks (ICNN’93), 
vol. II, p. 810. Piscataway, NJ: IEEE Press. 

31. Giles, C. L. and Omlin, G. W., 1994. Pruning Recurrent Neural Networks for 
Improved Generalization Performance. IEEE Transactions on Neural Networks 5, 
no. 5:848-851. 

32. Goudreau, M. W., Giles, C. L., Chakradhar, S. T., and Chen, D., 1993. On Re- 
current Neural Networks and Representing Finite State Recognizers. In Third In- 
ternational Conference on Artificial Neural Networks, pp. 51-55. The Institution 
of Electrical Engineers, London, UK. 

33. Gurevich, Y., 1995. Evolvin Algebras 1993: Lipari Guide. In Specification and 
Validation Methods, ed. E. Borger, pp. 9-36. Oxford University Press. 

34. Hochreiter, S. and Schmidhuber, J., 1997. Long Short Term Memory. Neural 
Computation 9, no. 8:123-141. 




On the Need for a Neural Abstract Machine 



159 



35. Judd, J. S., 1989. Neural Network Design and the Complexity of Learning. MIT 
press. 

36. Koch, M., Fischer, I., and Berthold, M. R., 1998. Showing the Equivalence of 
Two Training Algorithms - Parti. In Proceedings of the IEEE International Joint 
Conference on Neural Networks, pp. 441-446. Anchorage, Alaska. 

37. Kolen, J. F., 1994. Fool’s Gold: Extracting Finite State Machines from Recurrent 
Network Dynamics. In Advances in Neural Information Processing Systems, eds. 
J. D. Cowan, G. Tesauro, and J. Alspector, vol. 6, pp. 501-508. Morgan Kaufmann 
Publishers, Inc. 

38. Kremer, S., 1996. Finite State Automata that Recurrent Cascade-Correlation Can- 
not Represent. In Advances in Neural Information Processing Systems 8, eds. 
D. Touretzky, M. Mozer, and M. Hasselno. MIT Press. 612-618. 

39. Kremer, S. C., 1995. On the Computational Power of Elman-Style Recurrent 
Networks. IEEE Transactions on Neural Networks 6, no. 4:1000-1004. 

40. Kuan, C.-M., Hornik, K., and White, H., 1994. A Convergence Result for Learning 
in Recurrent Neural Networks. Neural Computation 6, no. 3:420-440. 

41. Lin, T., Horne, B. G., Tino, P., and Giles, C. L., 1996. Learning Long-Term 
Dependencies in NARX Recurrent Neural Networks. IEEE Transactions on Neural 
Networks 7, no. 6:1329-1338. 

42. Ma, S. and Ji, C., 1998. Fast Training of Reccurent Networks Based on the EM 
Algorithm. IEEE Transactions on Neural Networks 9, no. 1:11-26. 

43. Maclin, R. and Shavlik, J. W., 1992. Refining Algorithms with Knowledge-Based 
Neural Networks: Improving the Chou-Fasman Algorithm for Protein Folding. In 
Computational Learning Theory and Natural Learning Systems, eds. S. Hanson, 
G. Drastal, and R. Rivest. MIT Press. 

44. McClelland, J. L. and Rumelhart, D. E., 1987. PARALLEL DISTRIBUTED PRO- 
CESSING, Explorations in the Miero structure of Cognition. Volume 1: Foundati- 
ons Volume 2: Psychological and Biological Models. MIT Press. The PDP Research 
Group, MIT. 

45. Nerrand, O., Roussel- Ragot, P., Personnaz, L., Dreyfus, G., and Marcos, S., 1993. 
Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New 
Algorithms. Neural Computation 5, no. 2:165-199. 

46. Omlin, C. and Giles, C., 1996. Constructing Deterministic Finite-State Automata 
in Recurrent Neural Networks. Journal of the ACM 43, no. 6:937-972. 

47. Omlin, C. W., Giles, C. L., and Miller, C. B., 1992. Heuristics for the Extraction of 
Rules from Discrete-Time Recurrent Neural Networks. In Proceedings International 
Joint Conferenee on Neural Networks 1992, vol. I, pp. 33-38. 

48. Pedersen, M. W. and Hansen, L. K., 1995. Recurrent Networks: Second Order 
Properties and Pruning. In Advances in Neural Information Processing Systems, 
eds. G. Tesauro, D. Touretzky, and T. Leen, vol. 7, pp. 673-680. The MIT Press. 

49. Puskorius, G. V. and Feldkamp, L. A., 1994. Neurocontrol of Nonlinear Dynamical 
Systems with Kalman Filter Trained Recurrent Networks. IEEE Transaetions on 
Neural Networks 5, no. 2:279-297. 

50. Rozemberg, G., Courcelle, B., Ehrig, H., Engels, G., Janssens, D., Kreowski, H., 
and Montanari, U., eds., 1997. Handbook of Graph Grammars: Foundations, vol. 1. 
Workd Scientific. 

51. Santini, S., Bimbo, A. D., and Jain, R., 1995. Block structured recurrent neural 
netorks. Neural Networks 8:135-147. 

52. Saunders, G. M., Angeline, P. J., and Pollack, J. B., 1994. Structural and Be- 
havioral Evolution of Recurrent Networks. In Advances in Neural Information 




160 D. Sona and A. Sperduti 



Processing Systems, eds. J. D. Cowan, G. Tesauro, and J. Alspector, vol. 6, pp. 
88-95. Morgan Kanfmann Publishers, Inc. 

53. Schmidhuber, J., 1992. Learning Complex, Extended Sequences Using the Principle 
of History Compression. Neural Computation 4, no. 2:234-242. 

54. Seidl, D. and Lorenz, D., 1991. A structure by which a recurrent neural network 
can approximate a nonlinear dynamic system. In Proceedings of the International 
Joint Conference on Neural Networks, vol. 2, pp. 709-714. 

55. Siegelmann, H., Horne, B., and Giles, C., 1997. Gomputational capabilities of 
recurrent NARX neural networks. IEEE Trans, on Systems, Man and Cybernetics 
In press. 

56. Siegelmann, H. T. and Sontag, E. D., 1991. Turing Computability with Neural 
Nets. Applied Mathematics Letters 4, no. 6:77-80. 

57. Siegelmann, H. T. and Sontag, E. D., 1995. On the Computational Power of Neural 
Nets. Journal of Computer and System Sciences 50, no. 1:132-150. 

58. Simard, P., Victorri, B., Le Cun, Y., and Denker, J., 1992. Tangent Prop — A 
Formalism for Specifying Selected Invariances in an Adaptive Network. In Advances 
in Neural Information Processing Systems, eds. J. E. Moody, S. J. Hanson, and 
R. P. Lippmann, vol. 4, pp. 895-903. Morgan Kanfmann Publishers, Inc. 

59. Sontag, E., 1993. Neural Networks for control. In Essays on Control: Perspectives 
in the Theory and its Applications, eds. H. L. Trentelman and J. C. Willemsd, pp. 
339-380. Boston, MA: Birkhauser. 

60. Sperduti, A. and Starita, A., 1997. Supervised Neural Networks for the Classifica- 
tion of Structures. IEEE Transactions on Neural Networks 8, no. 3:714-735. 

61. Sun, R. and Sessions, C., 1998. Extracting plans from reinforcement learners. 
Proceedings of the 1998 International Symposium on Intelligent Data Engineering 
and Learning, eds. L. Xu, L. Chan, I. King, and A. Fu, pp. 243-248. Springer- Verlag. 

62. Sutton, R. S., 1988. Learning to Predict by the Methods of Temporal Differences. 
Machine Learning 3:9-44. 

63. Tesauro, G., 1992. Practical Issues in Temporal Difference Learning. Machine 
Learning 8:257-277. 

64. Tino, P., Horne, B., and C.L. Giles, 1995. Fixed Points in Two-Neuron Discrete 
Time Recurrent Networks: Stability and Bifurcation Considerations. Tech. Rep. 
UMIACS-TR-95-51 and CS-TR-3461, Institute for Advance Computer Studies, 
University of Maryland, College Park, MD 20742. 

65. Towell, G. G. and Shavlik, J. W., 1993. Extracting Refined Rules from Knowledge- 
Based Neural Networks. Maehine Learning 13:71-101. 

66. Tsoi, A., 1998a. Gradient Based Learning Methods. In Adaptive Processing of Se- 
guences and Data Structures: Leeture Notes in Artificial Intelligence, eds. C. Giles 
and M. Gori, pp. 27-62. New York, NY: Springer Verlag. 

67. Tsoi, A., 1998b. Recurren Neural Network Architectures: An Overview. In Ad- 
aptive Processing of Sequences and Data Structures: Lecture Notes in Artificial 
Intelligence, eds. G. Giles and M. Gori, pp. 1-26. New York, NY: Springer Verlag. 

68. Tsoi, A. and Tan, S., 1997. Recurrent Neural Networks: A constructive algorithm 
and its properties. Neuroeomputing 15, no. 3-4:309-326. 

69. Tsoi, A. C. and Back, A., 1997. Discrete Time Recurrent Neural Network Archi- 
tectures: A Unifying Review. Neurocomputing 15:183-223. 

70. Wan, E. A. and Beaufay, F., 1998. Diagrammatic Methods for Deriving and Re- 
lating Temporal Neural Network Algorithms. In Adaptive Processing of Sequences 
and Data Structures: Lecture Notes in Artificial Intelligence, eds. C. Giles and 
M. Gori, pp. 63-98. New York, NY: Springer Verlag. 




On the Need for a Neural Abstract Machine 



161 



71. Wan, E. A. and Beaufays, F., 1996. Diagrammatic Derivation of Gradient Algo- 
rithms for Neural Networks. Neural Computation 8, no. 1:182-201. 

72. Wiklicky, H., 1994. On the Non-Existence of a Universal Learning Algorithm 
for Recurrent Neural Networks. In Advances in Neural Information Processing 
Systems, eds. J. D. Cowan, G. Tesauro, and J. Alspector, vol. 6, pp. 431-436. 
Morgan Kaufmann Publishers, Inc. 

73. Wiles, J. and Bollard, S., 1996. Beyond finite state machines: steps towards repre- 
senting and extracting context-free languages from recurrent neural networks. In 
NIPS’96 Rule Extraction from Trained Artificial Neural Networks Workshop, eds. 
R. Andrews and J. Diederich. 

74. Williams, R. J., 1992. Some Observations on the Use of the Extended Kalman 
Filter as a Recurrent Network Learning Algorithm. Tech. Rep. NU-CCS-92-1, 
Computer Science, Northeastern University, Boston, MA. 

75. Williams, R. J. and Zipser, D., 1988. A Learning Algorithm for Continually Run- 
ning Fully Recurrent Neural Networks. Tech. Rep. ICS Report 8805, Institute for 
Cognitive Science, University of California at San Diego, La Jolla, CA. 

76. Wu, L. and Moody, J., 1996. A Smoothing Regularizer for Feedforward and Re- 
current Neural Networks. Neural Computation 8, no. 3:461-489. 




Sequence Mining in Categorical Domains: 
Algorithms and Applications 



Mohammed J. Zaki 

Computer Science Department 
Rensselaer Polytechnic Institute, Troy, NY 



1 Introduction 

This chapter focuses on sequence data in which each example is represented as a 
sequence of “events” , where each event might be described by a set of predicates, 
i.e., we are dealing with categorical sequential domains. Examples of sequence 
data include text, DNA sequences, web usage data, multi-player games, plan 
execution traces, and so on. 

The sequence mining task is to discover a set of attributes, shared across time 
among a large number of objects in a given database. For example, consider the 
sales database of a bookstore, where the objects represent customers and the 
attributes represent authors or books. Let’s say that the database records the 
books bought by each customer over a period of time. The discovered patterns 
are the sequences of books most frequently bought by the customers. An example 
could be that, “70% of the people who buy Jane Austen’s Pride and Prejudice 
also buy Emma within a month.” Stores can use these patterns for promotions, 
shelf placement, etc. Consider another example of a web access database at a 
popular site, where an object is a web user and an attribute is a web page. 
The discovered patterns are the sequences of most frequently accessed pages at 
that site. This kind of information can be used to restructure the web-site, or 
to dynamically insert relevant links in web pages based on user access patterns. 
Other domains where sequence mining has been applied include identifying plan 
failures [15], selecting good features of classification [7], finding network alarm 
patterns [4], and so on. 

The task of discovering all frequent sequences in large databases is quite 
challenging. The search space is extremely large. For example, with m attributes 
there are 0{m^) potentially frequent sequences of length k. With millions of 
objects in the database the problem of I/O minimization becomes paramount. 
However, most current algorithms are iterative in nature, requiring as many full 
database scans as the longest frequent sequence; clearly a very expensive process. 

In this chapter we present SPADE (Sequential PAttern Discovery using 
Equivalence classes), a new algorithm for discovering the set of all frequent 
sequences. The key features of our approach are as follows: 1) We use a vertical 
id-list database format, where we associate with each sequence a list of objects in 
which it occurs, along with the time-stamps. We show that all frequent sequences 
can be enumerated via simple temporal joins (or intersections) on id-lists. 2) We 
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use a lattice-theoretic approach to decompose the original search space (lattice) 
into smaller pieces (sub-lattices) which can be processed independently in main- 
memory. Our approach requires a few (usually three) database scans, or only a 
single scan with some pre-processed information, thus minimizing the I/O costs. 
3) We decouple the problem decomposition from the pattern search. We propose 
two different search strategies for enumerating the frequent sequences within 
each sub-lattice: breadth-first and depth-first search. 

SPADE not only minimizes I/O costs by reducing database scans, but also 
minimizes computational costs by using efficient search schemes. The vertical 
id-list based approach is also insensitive to data-skew. An extensive set of expe- 
riments shows that SPADE outperforms previous approaches by a factor of two, 
and by an order of magnitude if we have some additional off-line information. 
Furthermore, SPADE scales linearly in the database size, and a number of other 
database parameters. 

We also discuss how sequence mining can be applied in practice. We show that 
in complicated real-world applications, like predicting plan failures, sequence 
mining can produce an overwhelming number of frequent patterns. We discuss 
how one can identify the most interesting patterns using pruning strategies in 
a post-processing step. Our experiments show that our approach improves the 
plan success rate from 82% to 98%, while less sophisticated methods for choosing 
which part of the plan to repair were only able to achieve a maximum of 85% 
success rate. We also showed that the mined patterns can be used to build 
execution monitors which predict failures in a plan before they occur. We were 
able to produce monitors with 100% precision, that signal 90% of all the failures 
that occur. 

As another application, we describe how to use sequence mining for feature 
selection. The input is a set of labeled training sequences, and the output is 
a function which maps from a new sequence to a label. In other words we are 
interested in selecting (or constructing) features for sequence classification. In 
order to generate this function, our algorithm first uses sequence mining on a 
portion of the training data for discovering frequent and distinctive sequences 
and then uses these sequences as features to feed into a classification algorithm 
(Winnow or Naive Bayes) to generate a classifier from the remainder of the data. 
Experiments show that the new features improve classification accuracy by more 
then 20% on our test datasets. 

The rest of the chapter is organized as follows: In Section 2 we describe the 
sequence discovery problem and look at related work in Section 3. In Section 4 
we develop our lattice-based approach for problem decomposition, and for pat- 
tern search. Section 5 describes our new algorithm. An experimental study is 
presented in Section 6. Section 7 discusses how the sequence mining can be used 
in a real planning domain, while Section 8 describes its use in feature selection. 
Finally, we conclude in Section 9. 
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2 Problem Statement 

The problem of mining sequential patterns can be stated as follows: Let I = 
{*!) * 2 , • • • , im} be a set of m distinct items comprising the alphabet. An event is 
a non-empty unordered collection of items (without loss of generality, we assume 
that items of an event are sorted in lexicographic order). A sequence is an ordered 
list of events. An event is denoted as {i\i 2 ■ ■ - ik), where ij is an item. A sequence 
a is denoted as {a\ —>■ «2 —>■••• —>■ ocq), where Ui is an event. A sequence with 
k items {k = \c(j\) is called a k-sequence. For example, {B — >■ AC) is a 

3-sequence. 

For a sequence a, if the event occurs before Oj, we denote it as < aj. 
We say a is a subsequence of another sequence /3, denoted as a < (3, if there 
exists a one-to-one order-preserving function / that maps events in a to events 
in /?, that is, 1) ai C /(a^), and 2) if Ui < aj then /(a^) < For example 

the sequence (i? — >■ AC) is a subsequence of {AB — >■ if — >■ ACD), since B C AB 
and AC C ACD, and the order of events is preserved. On the other hand the 
sequence {AB — >■ E) is not a subsequence of {ABE), and vice versa. 

The database D for sequence mining consists of a collection of input-sequences. 
Each input-sequence in the database has an unique identifier called sid, and each 
event in a given input-sequence also has a unique identifier called eid. We assume 
that no sequence has more than one event with the same time-stamp, so that 
we can use the time-stamp as the event identifier. 

An input-sequence C is said to contain another sequence a, if a ^ C, i.e., 
if a is a subsequence of the input-sequence C. The support or frequency of a 
sequence, denoted a{a,T>), is the the total number of input-sequences in the 
database T> that contain a. Given a user-specified threshold called the minimum 
support (denoted minsup), we say that a sequence is frequent if occurs more 
than minsup times. The set of frequent fc-sequences is denoted as A frequent 
sequence is maximal if it is not a subsequence of any other frequent sequence. 

Given a database T> of input-sequences and minsup, the problem of mining 
sequential patterns is to find all frequent sequences in the database. For example, 
consider the input database shown in Figure 1. The database has eight items {A 
to H), four input-sequences, and ten events in all. The figure also shows all the 
frequent sequences with a minimum support of 50% (i.e., a sequence must occur 
in at least 2 input-sequences) . In this example we have a two maximal frequent 
sequences, ABE and D — >■ BE — >■ A. 

Some comments are in order to see the generality of our problem formulation: 
1) We discover sequences of subsets of items, and not just single item sequences. 
For example, the set BE in {D — >■ BE — >■ A). 2) We discover sequences with 
arbitrary gaps among events, and not just the consecutive subsequences. For ex- 
ample, the sequence {D — >• BE — >• A) is a subsequence of input-sequence 1, even 
though there is an intervening event between D and BE. The sequence symbol 
— >■ simply denotes a happens-after relationship. 3) Our formulation is general 
enough to encompass almost any categorical sequential domain. For example, 
if the input-sequences are DNA strings, then an event consists of a single item 
(one of A, C, G, T). If input-sequences represent text documents, then each word 
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Fig. 1. Original Input-Sequence Database 



(along with any other attributes of that word, e.g., noun, position, etc.) would 
comprise an event. Even continuous domains can be represented after a suitable 
discretization step. 

Once the frequent sequences are known, they can be used to obtain rules 
that describe the relationship between different sequence items. Let a and (3 
be two sequences. The confidence of a sequence rule a /3 is the conditional 
probability that sequence /3 occurs, given that a occurs in an input-sequence, 
given as 



Conf{a ^ (3, V) 



a(a — >■ 13, T>) 
a{a, V) 



Given a user-specified threshold called the minimum confidence (denoted 
mimconf), we say that a sequence rule is confident if Conf{a,T>) > miu-conf. 
For example, the rule {D — >■ BF) (D — >■ BF — >■ A) has 100% confidence. 



3 Related Work 

The problem of mining sequential patterns was introduced in [2]. They also 
presented three algorithms for solving this problem. The AprioriAll algorithm 
was shown to perform better than the other two approaches. In subsequent 
work [12], the same authors proposed the GSP algorithm that outperformed 
AprioriAll by up to 20 times. They also introduced maximum gap, minimum 
gap, and sliding window constraints on the discovered sequences. 

We use GSP as a base against which we compare SPADE, as it is one of 
the best previous algorithms. GSP makes multiple passes over the database. 
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In the first pass, all single items (1-sequences) are counted. From the frequent 
items a set of candidate 2-sequences are formed. Another pass is made to gat- 
her their support. The frequent 2-sequences are used to generate the candidate 
3-sequences. A pruning phase eliminates any sequence at least one of whose sub- 
sequences is not frequent. For fast counting, the candidate sequences are stored 
in a hash-tree. This iterative process is repeated until no more frequent sequen- 
ces are found. For more details on the specific mechanisms for constructing and 
searching hash-trees, please refer to [12]. 

Independently, [10] proposed mining for frequent episodes, which are essen- 
tially frequent sequences in a single long input-sequence (typically, with single 
items events, though they can handle set events). However our formulation is ge- 
ared towards finding frequent sequences across many different input-sequences. 
They further extended their framework in [9] to discover generalized episo- 
des, which allows one to express arbitrary unary conditions on individual se- 
quence events, or binary conditions on event pairs. The MEDD and MSDD 
algorithms [11] discover patterns in multiple event sequences; they explore the 
rule space directly instead of the sequence space. 

Sequence discovery bears similarity with association discovery [1,16,14]; it can 
be thought of as association mining over a temporal database. While association 
rules discover only intra-event patterns (called itemsets), we now also have to 
discover inter-event patterns (sequences). Further, the sequence search space 
is much more complex and challenging than the itemset space; the set of all 
frequent sequences is a superset of the set of frequent itemsets. 

4 Sequence Enumeration: Lattice-Based Approach 

Theorem 1. Given a set I of items, the ordered set S of all possible sequences 
on the items, induced by the subsequence relation defines a hyper-lattice with 
the following two operations: the join, denoted \/ , of a set of sequences Ai G S 
is the set of minimal common supers equences, and the meet, denoted f\, of a set 
of sequences is the set of maximal common subsequences. More formally. 

Join: = {a \ Ai ^ a and Ai< j3 with /3 ^ a =k /3 = a} 

Meet: = {a \ a ^ Ai and (3 ^ Ai with a ^ (3 ^ (3 = a} 

Note that in a regular lattice the join and meet refers to the unique minimum 
upper bound and maximum lower bound. In a hyper-lattice the join and meet 
need not produce a unique element; instead the result can be a set of minimal 
upper bounds and maximal lower bounds. In the rest of this chapter we will 
usually refer to the sequence hyper-lattice as a lattice, since the sequence context 
is understood. 

Figure 2 shows the sequence lattice induced by the maximal frequent sequen- 
ces ABF and D — >■ BF -G A, for our example database. The bottom or least 
element, denoted T, of the lattice is T = {}, and the set of atoms (elements 
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Fig. 2. Lattice Induced by Maximal Frequent Sequences ABF and D ^ BF ^ A 



directly connected to the bottom element), denoted A, is given by the frequent 
items A = {A, B, D,F}. To see why the set of all sequences forms a hyper-lattice, 
consider the join of A and B; AV B = {{AB), (B — >■ A)}. As we can see the join 
produces two minimal upper bounds (i.e., minimal common super-sequences). 
Similarly, the meet of two (or more) sequences can produce a set of maximal 
lower bounds. For example, {AB) A (B — >■ A) = {(A), (i?)}, both of which are 
the maximal common sub-sequences. 

In the abstract the sequence lattice can be potentially infinite, since we can 
have arbitrarily long sequences. Fortunately, in all practical cases not only is 
the lattice bounded (the longest sequence can have C ■ T items, where C is the 
maximum number of events per input-sequence and T is the maximum event 
size), but the set of frequent sequences is also very sparse (depending on the 
minsup value). For our example, we have C = 4 and T = 4, thus the longest 
sequence can have at most 16 items. 

The set of all frequent sequences is closed under the meet operation, i.e., 
if X and Y are frequent sequences, then the meet X AY (maximal common 
subsequence) is also frequent. However, it is not closed under joins since X and 
Y being frequent, doesn’t imply that X \/ Y (minimal common supersequence) 
is frequent. The closure under meet leads to the well known observation on 
sequence frequency: 

Lemma 1. All subsequences of a frequent sequence are frequent. 

What the lemma says is that we need to focus only on those sequences whose 
subsequences are frequent. This leads to a very powerful pruning strategy, where 
we eliminate all sequences, at least one of whose subsequences is infrequent. This 
property has been leveraged in many sequence mining algorithms [12,10,11]. 
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4.1 Support Counting 

Let’s associate with each atom X in the sequence lattice its id-list, denoted 
C{X), which is a list of all input-sequence (sid) and event identifier (eid) pairs 
containing the atom. Figure 3 shows the id-lists for the atoms in our example 
database. For example consider the atom D. In our original database in Figure 1, 
we see that D occurs in the following input-sequence and event identifier pairs 
{(1, 10), (1,25), (4, 10)}. This forms the id-list for item D. 
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Fig. 4. Naive Temporal Joins 



Lemma 2. For any X £ S, let J = {Y £ .4(5) |F ^ X}. Then X = 
and cr{X) = |p|ygj£(y)|, where f} denotes a temporal join of the id-lists, and 
\L{Z)\, called the cardinality of L{Z), denotes the number of distinct sid values 
in the id-list for a sequence Z . 

The above lemma states that any sequence in S can be obtained as a temporal 
join of some atoms of the lattice, and the support of the sequence can be obtained 
by joining the id-list of the atoms. Let’s say we wish to compute the support 
of sequence {D -£ BF — >■ A). Here the set J = {D,B,F,A}. We can perform 
temporal joins one atom at a time to obtain the final id-list, as shown in Figure 4. 
We start with the id- list for atom D and join it with that of B. Since the symbol 
-£ represents a temporal relationship, we find all occurrences of B after a U in 
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an input-sequence, and store the corresponding time-stamps or eids, to obtain 
C{D — >■ B). We next join the id-list of {D — >■ B) with that of atom but 
this time the relationship between B and F is & non-temporal one, which we 
call an equality join, since they must occur at the same time. We thus find all 
occurrences of B and F with the same eid and store them in the id-list for 
{D — >■ BF). Finally, a temporal join with C{A) completes the process. 



Space-Efficient Joins If we naively produce the id-lists (as shown in Figure 4) 
by storing the eids (or time-stamps) for all items in a sequence, we waste too 
much space. Using the lemma below, which states that we can always generate 
a sequence by joining its lexicographically first two k — 1 length subsequences, it 
is possible to reduce the space requirements, by storing only (sid,eid) pairs (i.e., 
only two columns) for any sequence, no matter how many items it has. 

Lemma 3. For any sequence X € S, let Xi and X 2 denote the lexicographically 
first two {k — 1) -subsequences of X. Then X = XiV X 2 and ^{X) = \L{Xi) fl 
C{X2)\. 

The reason why this lemma allows space reduction is because the first two 
k—1 length sequences, X\ and X 2 , of a sequence X, share a k — 2 length prefix. 
Since they share the same prefix, it follows that the eids for the items in the 
prefix must be the same, and the only difference between Xi and X 2 is in the 
eids of their last items. Thus it suffices to discard all eids for the prefix, and to 
keep track of only the eids for the last item of a sequence. 

Figure 5 illustrates how the idlist for {D — >■ BF — >■ A) can be obtained using 
the space-efficient idlist joins. Let X = {D ^ BF — >■ A), then we must perform 
a temporal join on its first two subsequences Xi = {D ^ BF) (obtained by 
dropping the last item from X), and X 2 = D ^ B ^ A (obtained by dropping 
the second to last item from X). Then, recursively, to obtain the id- list for 
{D — >■ BF) we must perform a equality join on the id-list of {D — >■ B) and 
{D ^ F). For {D ^ B ^ A) we must perform a temporal join on C{D — >■ B) 
and £{D — >■ A). Finally, the 2-sequences are obtained by joining the atoms 
directly. Figure 5 shows the complete process, starting with the initial vertical 
database of the id-list for each atom. As we can see, at each point only (sid,eid) 
pairs are stored in the id-lists (i.e., only the eid for the last item of a sequence 
are stored). The exact details of the temporal joins are provided in Section 5.3, 
when we discuss the implementation of SPADE. 

Lemma 4. Let X andY he two sequences , with X A Y. Then |/i(Ai)| > |£(U)|. 

This lemma says that if the sequence A is a subsequence of Y, then the car- 
dinality of the id- list of Y (i.e., its support) must be equal to or less than the 
cardinality of the id-list of A. A practical and important consequence of this 
lemma is that the cardinalities of intermediate id-lists shrink as we move up the 
lattice. This results in very fast joins and support counting. 




Fig. 5. Computing Support via Space-Efficient Temporal Id-list Joins 



4.2 Lattice Decomposition: Prefix-Based Classes 

If we had enough main-memory, we could enumerate all the frequent sequences 
by traversing the lattice, and performing temporal joins to obtain sequence sup- 
ports. In practice, however, we only have a limited amount of main-memory, and 
all the intermediate id-lists will not fit in memory. This brings up a natural que- 
stion: can we decompose the original lattice into smaller pieces such that each 
piece can be solved independently in main-memory. We address this question 
below. 

Define a function p : {S, N) ^ S where S is the set of sequences, N is the set 
of non-negative integers, and p{X, k) = X[1 : A:]. In other words, p{X, k) returns 
the k length prefix of X. Define an equivalence relation 6k on the lattice S as 
follows: yx, Y £ S, we say that X is related to Y under 6k, denoted as X Y 
if and only if p{X, k) = p{Y, k). That is, two sequences are in the same class if 
they share a common k length prefix. 

Figure 6 shows the partition induced by the equivalence relation 6i on S, 
where we collapse all sequences with a common item prefix into an equivalence 
class. The resulting set of equivalence classes is {[A], [B], [D], [F]}. We call these 
first level classes as the parent classes. 
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Fig. 6. a) Equivalence Classes of S Induced by 9i, b) Classes of [D]e^ Induced by 62 



Lemma 5. Each equivalence class induced by the equivalence relation 9^ 

is a sub- (hyper )lattice of S. 

Each [X]q^ is thus a hyper-lattice with its own set of atoms. For example, 
the atoms of \D]g-^ are {D A, D ^ B, D ^ E}, and the bottom element is 
E = D. By the application of Corollary 3, we can generate the supports of all 
the sequences in each class (sub-lattice) using temporal joins. If there is enough 
main-memory to hold temporary id-lists for each class, then we can solve each 
[X] 0 ^ independently. 

In practice we have found that the one level decomposition induced by 9i 
is sufficient. However, in some cases, a class may still be too large to be solved 
in main-memory. In this scenario, we apply recursive class decomposition. Let’s 
assume that [D] is too large to fit in main-memory. Since [D] is itself a lattice, 
it can be decomposed using the relation 02- Figure 6 shows the classes induced 
by applying 62 on [D] (after applying 9i on S). Each of the resulting six parent 
classes, [A], [B], [D — >• A], [D — >• B], [D — >• F], and [F], can be processed 
independently to generate frequent sequences from each class. Thus depending on 
the amount of main-memory available, we can recursively partition large classes 
into smaller ones, until each class is small enough to be solved independently in 
main-memory. 



5 SPADE: Implementation Issues 

In this section we describe the implementation of SPADE. Figure 7 shows the 
high level structure of the algorithm. The main steps include the computation of 
the frequent 1-sequences and 2-sequences, the decomposition into prefix-based 
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parent equivalence classes, and the enumeration of all other frequent sequences 
via BFS or DFS search within each class. We will now describe each step in some 
more detail. 



SPADE (min sup, V): 

= { frequent items or 1-sequences }; 

T 2 — { frequent 2-sequences }; 

£ — { equivalence classes 

for all [X] G £ do Enumerate- Frequent- Seq(\X\)-, 



Fig. 7. The SPADE Algorithm 



5.1 Computing Fhequent 1-Sequences and 2-Sequences 

Most of the current sequence mining algorithms [2,12] assume a horizontal da- 
tabase layout such as the one shown in Figure 1. In the horizontal format the 
database consists of a set of input-sequences. Each input-sequence has a set of 
events, along with the items contained in the event. In contrast our algorithm 
uses a vertical database format, where we maintain a disk-based id-list for each 
item, as shown in Figure 3. Each entry of the id-list is a (sid, eid) pair where 
the item occurs. This enables us to check support via simple id-list joins. 

Computing : Given the vertical id-list database, all frequent 1-sequences can 
be computed in a single database scan. For each database item, we read its id-list 
from the disk into memory. We then scan the id-list, incrementing the support 
for each new sid encountered. 



sid (item, eid) pairs 

1 (A 15) (A 20) (A 25) (B 15) (B 20) (C 10) (C 15) (C 25) 
(D 10) (D 25) (F 20) (F 25) 

2 (A 15) (B 15) (E 20) (F 15) 

3 (A 10) (B 10) (F 10) 

4 (A 25) (B 20) (D 10) (F 20) (G 10) (G 25) (H 10) (H 25) 



Fig. 8. Vertical-to-Horizoutal Database Recovery 



Computing T^'- Let N = jiFij be the number of frequent items, and A the aver- 
age id-list size in bytes. A naive implementation for computing the frequent 
2-sequences requires (^) id-list joins for all pairs of items. The amount of data 
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read is A - N ■ {N — l)/2, which corresponds to around N/2 data scans. This is 
clearly inefficient. Instead of the naive method we propose two alternate soluti- 
ons: 

1. Use a preprocessing step to gather the counts of all 2-sequences above a 
user specified lower bound. Since this information is invariant, it has to be 
computed once, and the cost can be amortized over the number of times the 
data is mined. 

2. Perform a vertical-to-horizontal transformation on-the-fly. This can be done 
quite easily, with very little overhead. For each item i, we scan its id-list 
into memory. For each (sid, eid) pair, say (s, e) in £(i), we insert (t, e) in the 
list for input-sequence s. For example, consider the id-list for item A, shown 
in Figure 3. We scan the first pair (1,15), and then insert (A, 15) in the 
list for input-sequence 1. Figure 8 shows the complete horizontal database 
recovered from the vertical item id- lists. Computing from the recovered 
horizontal database is straight-forward. We form a list of all 2-sequences in 
the list for each sid, and update counts in a 2-dimensional array indexed by 
the frequent items. 

5.2 Enumerating Frequent Sequences of a Class 

Figure 9 shows the pseudo-code for the breadth-first and depth-first search. The 
input to the procedure is a set of atoms of a sub-lattice S', along with their 
id-lists. Frequent sequences are generated by joining the id-lists of all pairs of 
atoms (including a self-join) and checking the cardinality of the resulting id-list 
against minsup. 



Enumerate- Prequent-Seq(5'): 

for all atoms Ai £ S do 

U = 0; 

for all atoms Aj £ S, with j > i do 
R = AiV Af, 

C{R) = C{Ai) n £(A,0; 
if c{R) > min^sup then 

Ti = TiU {i?}; U {i?}; 

end 

if (Depth-First-Search) then Enumerate- Frequent- Seq{Ti)\ 

end 

if (Breadth-First-Search) then 

for all Ti 7 ^ 0 do Enumerate- Frequent- Seq{Ti)-, 



Fig. 9. Pseudo-code for Breadth-First and Depth-First Search 



SPADE supports both breadth-first (BFS) and depth-first (DFS) search. In 
BFS we process all the child classes at a level before moving on to the next level. 
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while in DFS, we completely solve all child equivalence classes along one path 
before moving on to the next path. DFS also requires less main-memory than 
BFS. DFS needs only to keep the intermediate id-lists for two consecutive classes 
along a single path, while BFS must keep track of id-lists for all the classes in 
two consecutive levels. Consequently, when the number of frequent sequences is 
very large, for example in dense domains or in cases where the minsup value 
is very low, DFS may be the only feasible approach, since BFS can run out of 
virtual memory. 

The sequences found to be frequent at the current level form the atoms of 
classes for the next level. This recursive process is repeated until all frequent 
sequences have been enumerated. In terms of memory management it is easy to 
see that we need memory to store intermediate id-lists for at most two conse- 
cutive levels. The depth-first search requires memory for two classes on the two 
levels. The breadth-first search requires memory of all the classes on the two 
levels. Once all the frequent sequences for the next level have been generated, 
the sequences at the current level can be deleted. 



5.3 Temporal Id-List Join 

We now describe how we perform the id-list joins for two sequences. Consider 
an equivalence class [B — >■ A] with the atom set {!?—>■ AB, B — >■ AD, B ^ A ^ 
A, B^A^D,B^A^ F’}. If we let P stand for the prefix B ^ A, then we 
can rewrite the class to get [P] = {PB, PD, P ^ A, P ^ D, P ^ F}. One can 
observe the class has two kinds of atoms: the event atoms {PB,PD}, and the 
sequence atoms {P ^ A, P ^ D, P ^ F}. We assume without loss of generality 
that the event atoms of a class always precede the sequence atoms. To extend the 
class it is sufficient to join the id-lists of all pairs of atoms. However, depending on 
the atom pairs being joined, there can be upto three possible resulting frequent 
sequences (these are the three possible minimal common super-sequences): 

1. Event Atom with Event Atom: If we are joining PB with PD, then the 
only possible outcome is new event atom PBD. 

2. Event Atom with Sequence Atom: If we are joining PB with P ^ A, 
then the only possible outcome is new sequence atom PB — >■ A. 

3. Sequence Atom with Sequence Atom: If we are joining P ^ A with 
P ^ F, then there are three possible outcomes: a new event atom P — >■ AF, 
and two new sequence atoms P ^ A ^ F and P — >■ F — >■ A. A special 
case arises when we join P ^ A with itself, which can produce only the new 
sequence atom P — >■ A — >• A. 

We now describe how the actual id-list join is performed. Consider Figure 10, 
which shows the hypothetical id-lists for the sequence atoms P — >■ A and P — >■ P. 
To compute the new id-list for the resulting event atom P — >■ AF, we simply need 
to check for equality of (sid,eid) pairs. In our example, the only matching pairs 
are {(8, 30), (8, 50), (8, 80)}. This forms the id-list for P — >• AF. To compute the 
id-list for the new sequence atom P ^ A ^ F, we need to check for a temporal 
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Fig. 10. Temporal Id-list Join 



relationship, i.e., for a given pair (s,ti) in C(P — >■ A), we check whether there 
exists a pair (s,t 2 ) in — >■ F) with the same sid s, but with t 2 > ti- If this is 
true, it means that the item F follows the item A for input-sequence s. In other 
words, the input-sequence s contains the pattern P ^ A ^ F , and the pair 
(s, ^ 2 ) is added to the pattern’s id-list. Finally, the id-list for P — >• F — >• A can be 
obtained in a similar manner by reversing the roles of P — >■ A and P — >■ P. The 
final id-lists for the three new sequences are shown in Figure 10. Since we join 
only sequences within a class, which have the same prefix (whose items have the 
same eid or time-stamp), we need only to keep track of the last item’s eid for 
determining the equality and temporal relationships. As a further optimization, 
we generate the id-lists of all the three possible new sequences in just one join. 

6 Experimental Results 

In this section we study the performance of SPADE by varying different database 
parameters and by comparing it with the GSP algorithm. GSP was implemented 
as described in [12]. For SPADE results are shown only for the BFS search. 
Experiments were performed on a lOOMHz MIPS processor with 256MB main 
memory running IRIX 6.2. The data was stored on a non-local 2GB disk. 

Synthetic Datasets The synthetic datasets are the same as those used in [12], 
albeit with twice as many input-sequences. We used the publicly available data- 
set generation code from the IBM Quest data mining project [5]. These datasets 
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Dataset 


C T S I D 


Size (MB) 1 


C10-T2.5-S4-I1.25-D(100K-1000K) 10 2.5 4 1.25 100,000 18.4-184.0 1 


C10-T5-S4-I2.5-D200K 


10 5 4 2.5 200,000 


54.3 1 


C20-T2.5-S4-I2.5-D200K 


20 2.5 4 2.5 200,000 


66.5 1 


C20-T2.5-S8-I1.25-D200K 


20 2.5 8 1.25 200,000 


76.4 1 





Fig. 11. Synthetic Datasets 



mimic real-world transactions, where people buy a sequence of sets of items. 
Some customers may buy only some items from the sequences, or they may buy 
items from multiple sequences. The input-sequence size and event size are clu- 
stered around a mean and a few of them may have many elements. The datasets 
are generated using the following process. First Nj maximal events of average 
size I are generated by choosing from N items. Then Ns maximal sequences of 
average size S are created by assigning events from Nj to each sequence. Next 
a customer (or input-sequence) of average C transactions (or events) is created, 
and sequences in Ns are assigned to different customer elements, respecting the 
average transaction size of T. The generation stops when D input-sequences 
have been generated. Like [12] we set Ns = 5000, Nj = 25000 and N = 10000. 
Figure 11 shows the datasets with their parameter settings. We refer the reader 
to [2] for additional details on the dataset generation. 



PLAN DATABASE 



Planid 


Time 


Eventld 


Action 


Outcome 


Route 


From 


To 


AtLocation 


Cargo 


Vehicle 


Vehicleld 


Weather 




10 


78 


Move 


Success 


Delta-Exodus 


Delta 


Exodus 






Helicopter 


Helil 


Good 


1 


20 


84 


Load 


Success 








Exodus 


People7 




Helil 




1 


30 


85 


Move 


Flat 


Exodus-Bamacle-Abyss 


Exodus 


Barnacle 






Helicopter 


Helil 


Fair 


1 


40 


101 


Unload 


Crash 








Barnacle 


People7 


Helicopter 


Helil 


Hazardous 


2 


10 


7 


Move 


Flat 


Delta-Calypso-Delta 


Delta 


Calypso 






Truck 


Truck 1 


Good 


2 


20 


10 


Move 


Breakdown 


Delta-Calypso-Delta 


Calypso 


Delta 






Truck 


Truck 1 


Good 



Fig. 12. Example Plan Database 



Plan Dataset This real dataset was obtained from a planning domain. The input 
consists of a database of plans for evacuating people from one city to another. 
Each plan has a unique identifier, and a sequence of actions or events. Each event 
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is composed of several different attributes including the event time, the unique 
event identifier, the action name, the outcome of the event, and a set of additional 
parameters specifying the weather condition, vehicle type, origin and destination 
city, cargo type, etc. Some example plans are shown in Figure 12. Each plan 
represents an input-sequence (with sid = Planid). Each distinct attribute and 
value pair is an item. For example, Action=Move, Action=Load, etc., are all 
distinct items. A set of items forms an event (with eid = Time). For example, 
the second row of the first plan corresponds to the event (84, Load, Success, 
Exodus, People?, Helil). 

The data mining goal is to identify the causes of plan failures. Each plan is 
tagged Failure or Success depending on whether or not it achieved its goal. We 
mine only the dataset of bad plans, which has 77 items, 202071 plans (input- 
sequences), and 829236 events in all. The average plan length is 4.1, and the 
average event length is 7.6. 



6.1 Comparison of SPADE with GSP 

Figure 13 compares SPADE with GSP, on different synthetic and the plan da- 
tasets. Each graph shows the results as the minimum support is changed from 
1% to 0.25%. Two sets of experiments are reported for each value of support. 
The bar labeled SPADE corresponds to the case where we computed via 
the vertical-to-horizontal transformation method described in Section 5.1. The 
times for GSP and SPADE include the cost of computing The bars labeled 
SPADE-F2 and GSP-F2 correspond to the case where was computed in a 
pre-processing step, and the times shown don’t include the pre-processing cost. 

The figures clearly indicate that the performance gap between the two algo- 
rithms increases with decreasing minimum support. SPADE is about twice as 
fast as GSP at lower values of support. In addition we see that SPADE-F2 ou- 
tperforms GSP-F2 by an order of magnitude in most cases. Another conclusion 
that can be drawn from the SPADE-F2 and GSP-F2 comparison is that nearly 
all the benefit of SPADE comes from the improvement in the running time after 
the pass since both algorithms spend roughly the same time in computing 
Between and SPADE outperforms GSP anywhere from a factor of 
three to an order of magnitude. 



6.2 Scaleup 

We study how SPADE performs with increasing number of input-sequences. 
Figure 14 shows how SPADE scales up as the number of input-sequences is 
increased ten- fold, from 0.1 million to 1 million (the number of events is increased 
from 1 million to 10 million, respectively). All the experiments were performed on 
the C10-T2.5-S4-IL25 dataset with different minimum support levels ranging 
from 0.5% to 0.1%. The execution times are normalized with respect to the 
time for the 0.1 million input-sequence dataset. It can be observed that SPADE 
scales almost linearly. SPADE also scales linearly in the number of events per 
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C20-T2.5-S4-I2.5-D200K 





Fig. 13. Periormance Comparison: Synthetic and Plan Datasets 




Fig. 14. Scale-up: Number of Input-Sequences 
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input-sequence, event size and the size of potential maximal frequent events and 
sequences [13]. 

7 Application I: Predicting Plan Failnres 

We saw in the last section that SPADE is an efficient and scalable method 
for mining frequent sequences. However, the mining process rarely ends at this 
stage. The more important aspect is how to take the results of mining and use 
them effectively within the target domain. In this section we briefly describe our 
experiences in applying sequence mining in a planning domain to predict failures 
before they happen, and to improve the plans. 

Using SPADE to And the frequent sequences we developed a system called 
PlanMine [15], which has been integrated into two applications in planning: 
the IMPROVE algorithm for improving large, probabilistic plans [6], and plan 
monitoring. 

IMPROVE automatically modifies a given plan so that it has a higher proba- 
bility of achieving its goal. IMPROVE runs PlanMine on the execution traces 
of the given plan to pinpoint defects in the plan that most often lead to plan 
failure. It then applies qualitative reasoning and plan adaptation algorithms to 
modify the plan to correct the defects detected by PlanMine. 

We applied SPADE to the planning dataset to detect sequences leading to 
plan failures. We found that since this domain has a complicated structure with 
redundancy in the data, SPADE generates an enormous number of highly fre- 
quent, but unpredictive rules [15]. Figure 15 shows the number of mined frequent 
sequences of different lengths for various levels of minimum support when we ran 
SPADE on the bad plans. At 60% support level we found an overwhelming num- 
ber of patterns (around 6.5 million). Even at 75% support, we have too many 
patterns (38386), most of which are quite useless for predicting failures when 
we compute their confidence relative to the entire database of plans. Clearly, 
all potentially useful patterns are present in the sequences mined from the bad 
plans; we must somehow extract the interesting ones from this set. 

We developed a three-step pruning strategy for selecting only the most pre- 
dictive sequences from the mined set: 

1. Pruning Normative Patterns: We eliminate all normative rules that are con- 
sistent with background knowledge that corresponds to the normal operation 
of a (good) plan, i.e., we eliminate those patterns that not only occur in bad 
plans, but also occur in the good plans quite often, since these patterns are 
not likely to be predictive of bad events. 

2. Pruning Redundant Patterns: We eliminate all redundant patterns that have 
the same frequency as at least one of their proper subsequences, i.e., we eli- 
minate those patterns q that are obtained by augmenting an existing pattern 
p, while q has the same frequency as p. The intuition is that p is as predictive 
as q. 

3. Pruning Dominated Patterns: We eliminate all dominated sequences that are 
less predictive than any of their proper subsequences, i.e., we eliminate those 
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patterns q that are obtained by augmenting an existing pattern p, where p 
is shorter or more general than q, and has a higher confidence of predicting 
failure than q. 





Fig. 15. a) Number of Frequent Sequences; b) Effect of Different Pruning Techniques 



Figure 15 shows the reduction in the number of frequent sequences after app- 
lying each kind of pruning. After normative pruning (by removing patterns with 
more than 25% support in good plans), we get more than a factor of 2 reduc- 
tion (from 38386 to 17492 sequences). Applying redundant pruning in addition 
to normative pruning reduces the pattern set from 17492 down to 113. Finally, 
dominant pruning, when applied along with normative and redundant pruning, 
reduces the rule set from 113 down to only 5 highly predictive patterns. The 
combined effect of the three pruning techniques is to retain only the patterns 
that have the highest confidence of predicting a failure, where confidence is given 
as: 



Conf{a) 



(r{a,'Db) 
a{a,T>b + Vg) 



where Vf, is the dataset of bad plans and Vg the dataset of good plans. 

These three steps are carried out automatically by mining the good and bad 
plans separately and comparing the discovered rules from the unsuccessful plans 
against those from the successful plans. There are two main goals: 1) to improve 
an existing plan, and 2) to generate a plan monitor for raising alarms. In the 
first case the planner generates a plan and simulates it multiple times. It then 
produces a database of good and bad plans in simulation. This information is 
fed into the mining engine, which discovers high frequency patterns in the bad 
plans. We next apply our pruning techniques to generate a final set of rules 
that are highly predictive of plan failure. This mined information is used for 
fixing the plan to prevent failures, and the loop is executed multiple times till no 
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further improvement is obtained. The planner then generates the final plan. For 
the second goal, the planner generates multiple plans, and creates a database of 
good and bad plans (there is no simulation step). The high confidence patterns 
are mined as before, and the information is used to generate a plan monitor that 
raises alarms prior to failures in new plans. 

7.1 Experiments 

Plan Improvement We first discuss the role of PlanMine in IMPROVE, a fully 
automatic algorithm which modifies a given plan to increase its probability of 
goal satisfaction [6] . Table 1 shows the performance of the IMPROVE algorithm 
on a large evacuation domain that contains 35 cities, 45 roads, and 100 people. 
We use a domain-specific greedy scheduling algorithm to generate initial plans 
for this domain. The initial plans contain over 250 steps. 



Table 1. Performance of Improve (averaged over 70 trials). 





initial 

plan 

length 


final 

plan 

length 


initial 

success 

rate 


final 

success 

rate 


num. 

plans 

tested 


IMPROVE 


272.3 


278.9 


0.82 


0.98 


11.7 


RANDOM 


272.3 


287.4 


0.82 


0.85 


23.4 


HIGH 


272.6 


287.0 


0.82 


0.83 


23.0 



We compared Improve with two less sophisticated alternatives. The RAN- 
DOM approach modifies the plan randomly five times in each iteration, and 
chooses the modification that works best in simulation. The HIGH approach 
replaces the PlanMine component of IMPROVE with a technique that simply 
tries to prevent the malfunctions that occur most often. As shown in Table 1, 
PlanMine improves the plan success rate from 82% to 98%, while less sophi- 
sticated methods for choosing which part of the plan to repair were only able to 
achieve a maximum of 85% success rate. 

Plan Monitoring Figure 16a shows the evaluation of the monitors produced with 
PlanMine on a test set of 500 (novel) plans. The results are the averages over 
105 trials, and thus each number reflects an average of approximately 50,000 
separate tests. Note that precision is the ratio of correct failure signals to the 
total number of failure signals, while recall is the percentage of failures identi- 
fied. The figure clearly shows that our mining and pruning techniques produce 
excellent monitors, which have 100% precision with recall greater than 90%. We 
can produce monitors with significantly higher recall, but only by reducing pre- 
cision to around 50%. The desired tradeoff depends on the application. If plan 
failures are very costly then it might be worth sacrificing precision for recall. For 
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comparison we also built monitors that signaled failure as soon as a fixed num- 
ber of malfunctions of any kind occurred. Figure 16b shows that this approach 
produces poor monitors, since there was no correlation between the number of 
malfunctions and the chance of failure (precision). 




Precision - 
Recall - 
Frequency 



6 8 10 12 
Failure Count 



Fig. 16. a) Using PlanMine for Prediction; b) Using Failure Count for Prediction 



8 Application II: Feature Selection 



Our next application of sequence mining is for feature selection. Many real world 
datasets contain irrelevant or redundant attributes. This may be because the 
data was collected without data mining in mind, or because the attribute de- 
pendences were not known a priori during data collection. It is well known that 
many data mining methods like classification, clustering, etc., degrade prediction 
accuracy when trained on datasets containing redundant or irrelevant attributes 
or features. Selecting the right feature set can not only improve accuracy, but 
can also reduce the running time of the predictive algorithms, and can lead to 
simpler, more understandable models. Good feature selection is thus one of the 
fundamental data preprocessing steps in data mining. 

Most research on feature selection to-date has focused on non-sequential do- 
mains. Here the problem may be defined as that of selecting an optimal feature 
subset of size I from the full m-dimensional feature space, where ideally I <C m. 
The selected subset should maximize some optimization criterion such as classi- 
fication accuracy or it should faithfully capture the original data distribution. 

Selecting the right features in sequential domains is even more challenging 
than in non-sequence data. The original feature set is itself undefined; there are 
potentially an infinite number of sequences of arbitrary length over d catego- 
rical attributes or dimensions. Even if we restrict ourselves to some maximum 
sequence length k, we have potentially 0{m^) subsequences over m dimensions. 
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The goal of feature selection in sequential domains is to select the best subset of 
sequential features out of the possible sequential features (i.e., subsequences). 

We now briefly describe FeatureMine [7], a scalable algorithm based on 
SPADE, that mines features to be used for sequence classification. The input 
database consists of a set of input-sequences with a class label. Let (3 he a, se- 
quence and c be a class label. The confidence of the rule /3 c is given as 
a{P, 'Dc)l(j{f3, V) where Vc is the subset of input-sequences in V with class label 
c. Our goal is to And all frequent sequences with high confidence. Figure 17a 
shows a database of customers with labels. There are 7 input-sequences, 4 be- 
longing to class Cl, and 3 belonging to class C 2 . In general there can be more 
than two classes. We are looking for different minsup in each class. For example, 
while C is frequent for class C 2 , it’s not frequent for class ci. The rule C C 2 
has confidence 3/4 = 0.75, while the rule C => ci has confidence 1/4 = 0.25. 




Fig. 17. a) Database with Class Labels, b) New Database with Boolean Features 



We now describe how frequent sequences /3i,...,/3„ can be used as features 
for classification. Recall that the input to most standard classifiers is an example 
represented as vector of feature-value pairs. We represent a example sequence 
a as a vector of feature-value pairs by treating each sequence j3i as a boolean 
feature that is true iff j3i < a. For example, suppose the features are fi = A ^ D, 
/2 = A — >■ BC, and /s = CD. The input sequence AB — >■ BD — >■ BC would be 
represented as (/i, 0), (/ 2 , 1), (/s, 0). Figure 17b shows the new dataset created 
from the frequent sequences of our example database of Figure la. 

FeatureMine uses the following heuristics to determine the “good” fea- 
tures: 1) features should be frequent, 2) they should be distinctive of at least one 
class, and 3) feature sets should not contain redundant features. FeatureMine 
employs pruning functions, similar to the three outlined in the last section, to 
achieve these objectives. Further all pruning constraints are directly integrated 
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into the mining algorithm itself, instead of applying pruning as a post-processing 
step. This allows FeatureMine to search very large spaces efficiently, which 
would have been infeasible otherwise. 

8.1 Experiments 

To evaluate the effectiveness of FeatureMine, we used the feature set it pro- 
duces as input to two standard classification algorithms: Winnow [8] and Naive 
Bayes [3]. We ran experiments on three datasets described below. In each case, we 
experimented with various settings for min sup, max^ (maximum event size), 
and maxi (maximum number of events) to generate reasonable results. 

Random Parity We first describe a non-sequential problem on which standard 
classification algorithms perform very poorly. Each input example consists of N 
parity problems of size M with L distracting, or irrelevant, features. Thus are 
a total oi N X M + L boolean-valued features. Each instance is assigned one 
of two class labels (ON or OFF) as follows. Out of the N parity problems (per 
instance), if the weighted sum of those with even parity exceeds a threshold, 
then the instance is assigned class label ON, otherwise it is assigned OFF. Note 
that if M > 1, then no feature by itself is at all indicative of the class label ON 
or OFF, which is why parity problems are so hard for most classifiers. The job 
of FeatureMine is essentially to figure out which features should be grouped 
together. We used a minsup of .02 to .05, maxi = 1 and maxw = M. 

Fire World We obtained this dataset from simple forest-fire domain [7]. We use 
a grid representation of the terrain. Each grid cell can contain vegetation, water, 
or a base. We label each instance with SUCCESS if none of the locations with 
bases have been burned in the final state, or FAILURE otherwise. Thus, our job is 
to predict if the bulldozers will prevent the bases from burning, given a partial 
execution trace of the plan. For this data, there were 38 items to describe each 
input-sequence. In the experiments reported below, we used minsup = 20%, 
maxw = 3, and maxi = 3, to make the problem tractable. 

Spelling To create this dataset, we chose two commonly confused words, such as 
“there” and “their” , “I” and “me” , “than” and “then” , and “your” and “you’re” , 
and searched for sentences in the 1-million-word Brown corpus containing either 
word [7]. We removed the target word and then represented each word by the 
word itself, the part-of-speech tag in the Brown corpus, and the position relative 
to the target word. For “there” vs. “their” dataset there were 2917 training ex- 
amples, 755 test examples, and 5663 feature/value pairs or items. Other datasets 
had similar parameters. In the experiments reported below, we used a minsup 
= 5%, maxw = 3, and maxi = 2. 

For each test in the parity and fire domains, we generated 7,000 random 
training examples. We mined features from 1,000 examples, pruned features that 
did not pass a chi-squared significance test (for correlation to a class the feature 
was frequent in) in 2,000 examples, and trained the classifier on the remaining 
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Table 2. Classification Results (FM denotes features produced by FeatureMine) 



Experiment 


Winnow 


WinnowFM 


Bayes 


BayesFM 


parity, N — 5,M = 3,L = 5 


.51 (.02) 


.97 (.03) 


.50 (.01) 


.97 (.04) 


parity, A = 3,M = 4, L = 8 


.49 (.01) 


.99 (.04) 


.50 (.01) 


1.0 (0) 


parity, A = 10, M = 4, L = 10 


.50 (.01) 


.89 (.03) 


.50 (.01) 


.85 (.06) 


fire, time = 5 


.60 Oil) 


.79 (.02) 


.69 002) 


.81 002) 


fire, time = 10 


.60 (.14) 


.85 (.02) 


.68 (.01) 


.75 002) 


fire, time = 15 


.55 (.16) 


.89 (.04) 


.68 (.01) 


.72 (.02) 


spelling, their vs. there 


.70 


.94 


.75 


.78 


spelling, I vs. me 


.86 


.94 


.66 


.90 


spelling, than vs. then 


.83 


.92 


.79 


.81 


spelling, you’re vs. your 


.77 


.86 


.77 


.86 



5,000 examples. We then tested on 1,000 additional examples. The results in 
Table 2 are averages from 25-50 such tests. For the spelling correction, we used 
all the examples in the Brown corpus, roughly 1000-4000 examples per word set, 
split 80-20 (by sentence) into training and test sets. We mined features from 500 
sentences and trained the classifier on the entire training set. 

Table 2, which shows the average classification accuracy using different fea- 
ture sets, confirms that the features produced by FeatureMine improved clas- 
sification performance. We compared using the feature set produced by Fea- 
tureMine with using only the primitive features themselves, i.e. features of 
length 1. The standard deviations are shown, in parentheses following each aver- 
age, except for the spelling problems for which only one test and training set 
were used. Both Winnow and Naive Bayes performed much better with the fea- 
tures produced by FeatureMine. In the parity experiments, the mined features 
dramatically improved the performance of the classifiers and in the other experi- 
ments the mined features improved the accuracy of the classifiers by a significant 
amount, often more than 20%. 

9 Conclusions 

In this chapter we presented SPADE, a new algorithm for fast mining of se- 
quential patterns in large databases. Unlike previous approaches which make 
multiple database scans and use complex hash-tree structures that tend to have 
sub-optimal locality, SPADE decomposes the original problem into smaller sub- 
problems using equivalence classes on frequent sequences. Not only can each 
equivalence class be solved independently, but it is also very likely that it can 
be processed in main-memory. Thus SPADE usually makes only three database 
scans - one for frequent 1-sequences, another for frequent 2-sequences, and one 
more for generating all other frequent sequences. If the supports of 2-sequences 
is available then only one scan is required. SPADE uses only simple temporal 
join operations, and is thus ideally suited for direct integration with a DBMS. 
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An extensive set of experiments was conducted to show that SPADE outper- 
forms the best previous algorithm, GSP, by a factor of two, and by an order of 
magnitude with precomputed support of 2-sequences. Further, it scales linearly 
in the number of input-sequences and other dataset parameters. 

We discussed how the mined sequences can be used in a planning applica- 
tion. A simple mining of frequent sequences produces a large number of patterns, 
many of them trivial or useless. We proposed novel pruning strategies applied 
in a post-processing step to weed out the irrelevant patterns and to locate the 
most interesting sequences. We used these predictive sequences to improve pro- 
babilistic plans and for raising alarms before failures happen. 

Finally, we showed how sequence mining can help select good features for 
sequence classification. These domains are challenging because of the exponential 
number of potential subsequence features that can be formed from the primitives 
for describing each item in the sequence data. The number of features, containing 
many irrelevant and redundant features, is too large to be practically handled by 
today’s classification algorithms. Our experiments using several datasets show 
that the features produced by mining predictive sequences significantly improves 
classification accuracy. 
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1. Introduction 

A hallmark of Cognitive Science is its interdisciplinary approach to the study of the struc- 
tures and mechanisms of human and artificial cognitive systems. Over the past decade, sequential 
pattern acquisition has attracted the attention of researchers from Computer Science, Cognitive 
Psychology and the Neurosciences. Methodologies for the investigation of sequence learning 
processes range from the exploration of computational mechanisms to the conduct of experimen- 
tal studies and the use of event-related brain potentials, functional magnetic resonance imaging 
and positron emission tomography (Clegg, DiGirolamo, & Keele 1998; Curran 1998). This 
interest in sequential pattern acquisition is also understandable from an evolutionary perspective: 
sequencing information and learning of event contingencies are fundamentally important proc- 
esses without which adaptive behavior in dynamic environments would hardly be possible. From 
sequencing continuous speech to learning operating sequences of technical devices or acquiring 
the skill to play a musical instrument, learning of event sequences seems to he an essential capa- 
bility of human (and artificial) cognition. It is therefore not surprising that we are witnessing a 
tremendous interest in sequence learning in the disciplines contributing to Cognitive Science. 
This chapter presents and analyses a model of sequence learning based on the Act-R theory (An- 
derson 1993; Anderson & Lehiere 1998). The model is rooted in, and has been successfully 
applied to, empirical studies using the serial reaction task that was introduced by Nissen and 
Bullemer (1987). 

A large variety of empirical studies using the serial reaction time paradigm have explored the 
conditions under which subjects display sensitivity to the sequential structure of events. In a 
typical sequence learning experiment subjects are exposed to visuospatial sequences in a speeded 
compatible response mapping serial reaction time task. Participants are asked to react as quickly 
and accurately as possible with a discriminative response to the location of a target stimulus that 
may appear at one of a limited number of positions. Usually these target stimuli consist of sig- 
nals that are presented in one of several horizontally aligned positions on a computer screen. 
Responses most frequently require participants to press keys that spatially match these positions. 
Unbeknownst to the subjects, the presented sequence of visual signals follows a well-defined 
systematicity resulting in a continuous stream of structured events to which subjects have to 
respond. Depending on the respective experiment, the systematicity of signals follows one of 
three types of patterns: it executes a deterministically repeating sequence of fixed length {Simple 
Repeating Paradigm; e.g. Willingham, Nissen, & Bullemer 1989), it is consfructed on the basis 
of the output of a probabilistic and noisy finite state-grammar {Probabilistic Sequence Paradigm; 
e.g. Jimenez, Mendez, & Cleeremans 1996), or it is generated by a complex set of rules {Deter- 
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ministic Rule-based paradigm; e.g. Stadler 1989) that allows the construction of several alterna- 
tive deterministic signal successions (Cleeremans & Jimenez 1998, p. 325). Sequence learning 
and thus sensitivity to the sequential constraints is said to have been expressed when (1) subjects 
exposed to systematic event sequences show faster response latencies than those responding to 
random event sequences, or (2) response times of subjects increase significantly when systematic 
sequences are temporarily switched to random sequences. The faster response times are interpreted 
as resulting from acquired knowledge about the event pattern that allows subjects to prepare their 
responses. Figure 1 illustrates a prototypical data pattern to be found when using the serial reac- 
tion task (Nissen & Bullemer 1987). As the graph shows, response latencies from subjects 
exposed to structured sequences in blocks 1-5 increase distinctly when the event succession is 
temporary switched to a random presentation in blocks 6 and 7 and decrease again when switched 
back to the systematic sequence in blocks 8 and 9. The amount of sequence learning can be 
assessed indirectly by contrasting response latencies to sttuctured sequences with reaction times 
to randomly presented events. 
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Figure 1. A prototypical data pattern in a study of sequence learning. 



A broad range of experimental studies in sequence learning report to have demonstrated func- 
tional dissociations between performance in sequence learning (as revealed indirectly by 
contrasting latencies in systematic vs. random blocks) and conscious awareness of a sequence (as 
assessed by direct post-task measures). These dissociations have led researchers to speculate that 
the serial reaction task and post-task tests might tap two independent knowledge bases: implicit 
learning that can only be expressed in immediate task performance, and explicit learning' that can 



* Frensch (1998) identifies no less than eleven different definitions of implicit learning in the field 
— ”a diversity that is undoubtedly symptomatie of the conceptual and methodological difficulties 
facing the field” (Cleeremans, Destrebecqz, & Boyer, 1998, p. 407). Berry and Dienes (1993) pro- 
vide a largely theory-neutral definition by referring to learning as implicit when subjeets acquire 
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be verbalized or used to predict event sequences. Investigating the role of consciousness in learn- 
ing has become a major focus in experimental studies of sequential pattern acquisition and the 
field has flourished along with research on artificial grammar learning (Reber 1993) and complex 
system control (Berry & Broadbent 1984) into one of the major paradigms for investigating 
implicit learning processes. 

In contrast to the extraordinarily wide body of experimental data available on serial pattern 
acquisition, theory development to explain the empirically observed phenomena has been surpris- 
ingly scarce. Much of the experimental work has concentrated on investigating the conditions 
under which phenomena like unconscious learning is likely to occur or when attentional resources 
are required for learning, while the development of detailed computational models of empirical 
data has been largely neglected. An important exception is Cleeremans’s model based on simple 
recurrent networks (1993; Cleeremans & McClelland 1991; Jimenez et al. 1996) which has been 
applied to a wide set of experimental data, including the classical study of Lewicki et al. (1988) 
and its conceptual replications by Permchet, Gallego, & Savy(1990) and Ferrer-Gil (1994). A 
simple recurrent network (Srn; Elman 1990) uses a fully-connected, three-layer, back-propagation 
architecture that allows the network to predict the next element of a sequence based upon the 
current stimulus and the network’s sensitivity to the temporal context in which consecutive 
stimuli occur. Sensitivity to the temporal context of a Srn is derived by utilizing fixed one-to- 
one recurrent connections between hidden units and context units that encode the activation vec- 
tor of a previous time step’s hidden units. Over time, these recurrent connections then enable 
contextual information to differentiate different parts of a sequence. 

A Srn seems to provide a particular natural account for implicit learning phenomena since all 
the knowledge of the network is stored in its connections and may thus only be expressed 
through performance — a phenomenon which is a central component in definitions of implicit 
learning (Berry & Dienes 1993). However, a current major limitation of Srns seems to be that 
they ”do not account for explicit learning or the difference between it and implicit learning but 
could perhaps be expanded to do so” (Stadler & Roedinger 1998, p. 126). Cleeremans and Jime- 
nez (1998, p. 350) note that modeling some effects would ’’probably require additional learning 
mechanisms (e.g., mechanisms based on explicit memory for salient aspects of the material) that 
are not all captured by the model”. It remains to be seen how such mechanisms can be integrated 
into the architecture of a Srn. 

Hybrid architectures provide a potentially fruitful way of integrating implicit and explicit 
knowledge. The Clarion architecture (Sun 1999) can be described as a two-level architecture 
where the top-level uses a localist representation (with each representational unit representing a 
distinct entity), while the bottom level based on distributed representations (where an entity to 
be represented corresponds to a pattern of activation among a set representational units). Con- 
scious knowledge is encoded in the representational nodes of Clarion’s top-level and is 
assumed to be accessible and explicit. Implicit knowledge resides in the distributed representa- 
tions of Clarion’s bottom /eve/ and is inaccessible to conscious processing. Learning at the top- 
level is described as explicit one-shot hypothesis-testing, while the bottom-level provides gradual 
weight-tuning mechanisms. Based on the task, the outcomes of the two levels can be altered with 
the top level generally involving controlled processing and the bottom-level being governed by 



information without intending to do so, and in such a way that the resulting knowledge is difficult 
to express using direct post-task tests. 
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automatic processing. The Clarion architecture has successfully been applied to modeling hu- 
man data from a number of implicit learning paradigms, including sequence learning (Sun 1999). 
According to Clarion, implicit learning is triggered by a situation that is too eomplex to be 
handled by the explicit processes at the top level and is thus processed by the bottom level which 
can better cope with complex relations in the environment. As Sun (1999) shows, the interplay 
of Clarion’s two levels provides a natural account for the relationship of explicit and implicit 
knowledge in contributing to human performance. 

2. An Act-R theory of sequence learning 

Recently, Lebiere and Wallach (1998; in prep.) have proposed a theory of sequenee learning 
based on the Act-R cognitive architecture (Anderson 1993; Anderson & Lebiere 1998). Act-R is 
a hybrid production system that distinguishes between a permanent procedural memory and a 
permanent deelarative memory. Procedural knowledge is encoded in Act-R using modular condi- 
tion-action-mles (productions) to represent potential actions to be taken when eertain conditions 
are met. Declarative stmctures ealled chunks are used to store factual knowledge in declarative 
memory. Chunks encode knowledge as structured, sehema-like configurations of labeled slots 
that can be organized hierarchically. A representation of goals in a goal stack is utilized to control 
information processing whereby exactly one chunk is designated to be the active goal of the 
system. Knowledge represented symbolically by chunks and productions is associated with sub- 
symbolic (i.e. real-valued) numerical quantities that control which productions are used and which 
chunks are retrieved from memory. These quantities reflect past statistics of use of the respective 
symbolic knowledge stmctures and are learned by Bayesian learning mechanisms derived from a 
rario«a/ of cognition (Anderson 1990). Generally, subsymbolic learning allows Act-R 
to adapt to the statistical stmcture of an environment. 

A basic assumption of the Act-R theory of sequence learning is that the mappings between 
stimulus locations and response keys in an experiment are encoded as chunks. Each chunk asso- 
ciates the respective stimulus loeation on the screen to the desired response key. These declarative 
representations essentially represent a straightforward explicit encoding of the experimental in- 
stmctions informing the subjects of the stimulus-response mappings. When a stimulus is 
observed, the chunk representing the mapping between that stimulus loeation and the associated 
response key will be retrieved. Each retrieval results in the immediate reinforcement of that chunk 
through Act-R’ s base-level learning meehanism that strengthens a chunk to reflect its frequency 
of use. Formally, activation reflects the log posterior odds that a chunk is relevant in a particular 
situation. The activation^, of a chunk i is computed as the sum of its base-level activation B, 
plus its context activation: 

A. = B.-k^ ^i^ji Activation Equation 

j 

In determining the context activation, Wi designates the attentional weight given the focus 
element j. An element j is in the focus, or in context, if it is part of the current goal chunk (i.e. 
the value of one of the goal chunk’s slots). Sji stands for the strength of association from element 
j to a ehunk i. Act-R assumes that there is a limited capacity of source activation and that each 
goal element emits an equal amount of aetivation. Source activation capacity is typically as- 
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sumed to be 1, i.e. if there are n source elements in the current focus each receives a source activa- 
tion of 1/n (Anderson & Lebiere 1998). The associative strength Sji between an activation source 
j and a chunk i is a measure of how often i was needed (i.e. retrieved in a production) when 
chunkj was in the context. Associative strengths provide an estimate of the log likelihood ratio 
measure of how much the presence of a cue j in a goal slot increases the probability that a particu- 
lar chunk! is needed for retrieval to instantiate a production. 

The base level activation of a chunk is learned by an architectural mechanism to reflect the 
past history of use of a chunk i: 

„ , ^ , nL-‘^ 

B. = In = In Base-Level Learning Equation 

7=1 

In the above formula ti stands for the time elapsed since the jth reference to chunk i while d is 
the memory decay rate and L denotes the life time of a chunk (i.e. the time since its creation). As 
Anderson and Schooler (1991) have shown, this equation produces the Power Law of Forgetting 
(Rubin & Wenzel 1996) as well as the Power Law of Learning (Newell & Rosenbloom 1981). 

When retrieving a chunk to instantiate a production, ACT-R selects the chunk with the highest 
activation .4, . However, some stochasticity is introduced in the system by adding gaussian noise 
of mean 0 and standard deviation o to the activation of each chunk. In order to be retrieved, 
the activation of a chunk needs to reach a fixed retrieval threshold t that limits the accessibility 
of declarative elements. If the gaussian noise is approximated with a sigmoid distribution, the 
probability P of chunk i to be retrieved by a production is: 

1 

P = j-^ Retrieval Probability Equation 

l + e " 

where s=\/3o/jt. The activation of a chunk i is directly related to the latency of its retrieval by a 
production p. Formally, retrieval time Pp is an exponentially decreasing function of the 
chunk’s activation H,: 

= Fe Retrieval Time Equation 

where f is a time scaling factor. In addition to the latencies for chunk retrieval as given by the 
Retrieval Time Equation, the total time of selecting and applying a production is determined by 
executing the actions of a production’s action part, whereby a value of 50 ms is typically as- 
sumed for elementary internal actions. External actions, such as pressing a key, usually have a 
longer latency determined by the Act-R/Pm perceptual-motor module (Byrne & Anderson 1998). 

In summary, subsymbolic activation processes in ACT-R make a chunk active to the degree 
that past experience and the present context (as given by the current goal) indicates that it is use- 
ful at this particular moment. In the ACT-R sequence learning model, these reinforcements will 
lead to higher activation levels for the chunks that map stimulus positions to the keys to be 
pressed, which then results in faster response latencies. This speedup will occur independently of 
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whether the stimulus sequence is systematie or random beeause it only depends upon the fre- 
quency of each retrieval. 

The fundamental assumption of the ACT-R model of sequence learning is the persistence of 
(working) memory. ACT-R states that the components of the current goal are sources of activa- 
tion. If the new goal is to respond to a particular stimulus with a certain response, we assume 
that the previous stimulus (i.e. the central element of the previous goal) remains in the encoding 
of the new goal. This assumption has two important implieations. First, sinee every goal eon- 
tains both the previous stimulus and the next one, when that goal is popped and becomes a 
chunk in declarative memory, it contains a record of a small fragment of the sequence. The set of 
these chunks constitutes the model’s ex/i/;cfr knowledge of the sequence. The second implication 
is that when the chunk encoding the mapping between the current stimulus and the proper re- 
sponse key is retrieved, both the current stimulus and the previous one are eomponents of the 
goal and thus sources of activation. This co-occurrence between previous stimulus (as a source of 
aetivation) and current stimulus (as a component of the mapping chunk being retrieved) is auto- 
matieally learned by Act-R in the association strengths between source stimulus and mapping 
chunk and thus facilitates further processing. The subsymbolie strengths of associations between 
consecutive sequence fragments eonstitute the model’s implicit knowledge of the sequenee. An 
additional assumption of the model eontrols the way chunks are built by determining the exact 
nature of the previous context that persists in the current goal. If no eorrect anticipation or guess 
was made as to the identity of the current stimulus, then that stimulus persists in the next goal, 
leading to the ereation of the next pair of the sequenee as described above. However, if the cur- 
rent stimulus could be correetly guessed by retrieving a chunk mapping the eurrent context to 
that stimulus, i.e. if the eurrent piece of sequence is already known, then the current goal, whieh 
is identieal to the chunk retrieved, persists in the next goal as a more eomplex context, allowing 
the next piece of sequenee to be encoded using the current one. Therefore, while all memory 
chunks only encode pairs, longer sequences can be represented beeause the first element of the 
pair ean be a chunk as well as a basic stimulus. 

Table 1 describes the procedural knowledge that uses the declarative encodings, both implicit 
and explicit, to perform the task. The basic goal, as expressed in the instructions to the sequence 
learning experiments, is to map the location of a screen stimulus to a key and press that key as a 
response. Production Input checks that no stimulus has been eneoded yet and checks if one is 
present and if so encodes it and places its loeation in the current goal. Production Map-Location 
then retrieves the ehunk from deelarative memory that maps this location to the proper response 
key, and plaees the key in the goal. Produetion Type-Key then types the respective key and pops 
the goal. Before a stimulus has appeared, produetion Guess attempts to retrieve a chunk that 
holds a piece of the sequence starting with the eurrent context, and if suceessfiil uses that ehunk 
to anticipate which stimulus will appear and which key to press. Onee the stimulus appears, if 
the antieipation was eorrect then the retrieved key can be typed directly without the need for 
mapping location to key. Otherwise, the production Bad-Guess withdraws the prepared response, 
which requires mapping the actual stimulus location to a different key. When a response has 
been given, if no eorreet guess was made then the production Next-Pair creates and focuses on a 
new goal with the current stimulus as the context, which will lead to the encoding of the next 
pair of stimuli. If a correct guess was made, however, then the production Higher-Chunk focuses 
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on a new goal with the current goal as the next context, leading to the creation of a new chunk 
encoding the current pieee of sequence together with the next stimulus. 



Input 


Guess 


IF the goal is to respond to a stimulus 
and no stimulus has been eneoded yet 
THEN check if a stimulus is present and encode it 

Map-Location 

IF the goal is to respond to a stimulus 
and no key has been prepared to respond to stimulus 
and key is associated to stimulus 
THEN get ready to respond with key 


IF the goal is to respond to a stimulus given 
a previous context 

and there is no stimulus present and no 
guess has been made 

and context was in the past most often 
followed by stimulus responded to by key 
THEN guess that the next stimulus will be 
stimulus and prepare to respond with key 


Type-Key 


Bad-Guess 


IF the goal is to respond to a stimulus 
and a response with key has been prepared 
THEN type key 


IF the goal is to respond to a stimulus 
and a guess was made that is different from 
the encoded stimulus 
THEN withdraw the prepared response key 


Next-pair 


Higher-Chunk 


IF the goal is to respond to a stimulus 
and a response was made without a correct guess 
THEN focus on a new goal with the stimulus as context 


IF the goal is to respond to a stimulus 
and a response was made with correct guess 
THEN focus on a new goal with this goal as 
context 



Table 1: Productions to perform the mapping of a stimulus to a response key 



The Act-R theory of sequence learning precisely specifies meehanisms and memory stmctures 
on a level that is sufficient to implement a computational model which generates qualitative and 
quantitative predictions that can be compared to empirical data. To explore the scope of the the- 
ory, we have modeled three seminal studies in the field (Willingham, Nissen, & Bullemer 1989; 
Permchet & Amorim 1992; Curran & Keele 1993). The experimental conditions in these studies 
vary widely with regard to the type of sequence used, the length of the sequence, the distribution 
of systematic vs. random blocks, the number and length of blocks, the Response-to-Stimulus 
Interval used and whether the serial reaction task was presented as a single task or in combination 
with a secondary task (’’tone counting”). Our validation of the ACT-R model was not restricted to 
comparing model-generated and empirical data on a single dimension, but comprised a compari- 
son of latencies, learning trajectories, errors, stimulus anticipations, individual differences as well 
as the stmcture of acquired chunks. In our view, the comparison of model predictions and em- 
pirical data presented in Lebiere and Wallach (1998; in prep.) provides clear evidence for the 
broad empirical scope and integrative character of the Act-R approach to sequence learning, 
overcoming the lack of theoretical development in the field that was criticized in the introduction 
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to this chapter. While the reader is referred to Lebiere and Wallaeh, (1998; in prep.) for an exten- 
sive discussion of the modeling results, the next section presents a fine-grained sensitivity 
analysis of the Act-R model. 

3. Empirical Analysis 

The goal of this section is to report the results of a detailed sensitivity analysis of the model’s 
performance as a fimction of a number of its parameters and of the experimental conditions to which it 
is applied. Conducting such a sensitivity analysis opens up the possibility of investigating the 
behavioral and computational effects of certain internal and external model manipulations and thus 
allows for a precise analysis of the implications of a model’s underlying theoretical assumptions. 
Although Van Lehn, Brown and Greeno (1984) have forcefully argued in favor of detailed ’’model 
experiments” more than fifteen years ago, sensitivity analyses are still the exception in eognitive 
modeling: ’’Cognitive science has given psychology a new way of expressing models of cogni- 
tion that is much more detailed and precise than its predecessors. But unfortunately, the increased 
detail and precision in stating models has not been accompanied by correspondingly detail and 
precise arguments analyzing and supporting them” (VmLdm, Brown, & Greeno 1984, p. 237). 
This section presents a fine-grained analysis of variations of central model and experiment parameters 
including: 

- sequence type, i.e. whether the stimuli composing the sequence appear only once {unique 
sequence), multiple times (ambiguous sequence) or both (mixed or hybrid sequence). 

- rate of memory decay, which controls how fast a sequence can be learned (and forgotten). 

- noise in chunk activation, which controls the stochasticity of the model. 

- threshold X on memory retrieval, which controls how much activation is needed for memory 
retrieval. It is expressed here, and varies inversely to, the Response-to-Stimulus Interval used in 
experimental studies. 

- change in the sequence, i.e. how fast an old sequence can be unlearned and a new one learned. 

- choice of knowledge representation, i.e. whether the sequence is encoded in hierarchical chunks 
or in fixed-length chunks, and in that case what the impact of the chunk length is. 

The model analysis will be primarily empirical, with results being reported for the average of 1000 
Monte Carlo runs of the model over 20 blocks of 12 stimuli, unless otherwise speeified. Whenever 
tractable, the empirical results will be supplemented by a formal analysis of the model using the Act- 
R equations introduced with the model. Due to the complexity of the model and the approximations 
inherent in some of the equations themselves, the results derived by the formal analysis are only 
approximate but are nonetheless useful in understanding the empirical results and the parameters that 
eontrol them. The results and analyses are reported for three dependent measures of the performance of 
the model: 

1. Response Time (RT) is the most widely used behavioral measure in cognitive psychology. 
In the Act-R model, response time is a measure both of how likely each element of the 
sequence is to be correctly retrieved from memory as well as of the amount of practice and 
context of the chunk holding that element. 

2. Odds of Guessing measure the likelihood of anticipating, correctly or incorrectly, each 
stimulus by attempting to retrieve the next element of the sequence from memory. It is 
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correlated but does not directly correspond to response time because of the possibility of 
incorrectly guessing the next stimulus and the context effects involved in mapping that 
stimulus. 

3. Odds of Error is another standard behavioral measure in cognitive psychology. In this 
model, it measures the likelihood of incorrectly anticipating the next stimulus through 
memory retrieval and is at most equal to the odds of guessing. It reflects the general con- 
fusion of the model about the identity of the sequence. 

While those quantitative measures are related, they express different characteristics of the model’s 
behavior. The trade-offs involved in trying to optimize these measures are a fundamental part of 
human cognition. Studying how these measures vary as a function of the parameters described 
previously should yield important insights into the model’s predictions about the powers and limits 
of human cognition. 

3.1 Sequence Type 

Intuitively, some sequences should be harder to learn than others. The difficulty in learning a 
sequence depends upon the bias of the model, but a good general measure of difficulty for se- 
quences of discrete elements is the ambiguity in the succession of stimuli. In our empirical 
analysis of the model we will use three types of sequence, listed in increasing order of difficulty: 

1. Unique sequences in which each stimulus only appears once in each repetition of the se- 
quence, e.g. a b c d, if a, b, c and d denote the possible spatial positions of a stimulus. 
In unique sequences the next stimulus can always be unambiguously predicted from the 
previous one. 

2. Mixed ov hybrid sequences in which some stimuli are always followed by the same stimuli 
but some are not, e.g. a b c b d c, where a and d are unique stimuli but b and c are not. 

3. Ambiguous sequences in which all stimuli appear multiple times in the sequence and no 
stimulus can be unambiguously predicted from the previous one, e.g. abcbadcacd 
b d. That sequence is a particular type of ambiguous sequence called a de Bruijn sequence 
(de Bruijn 1946), because each stimulus in the sequence is followed once and only once by 
all other stimuli (e.g. a is followed once by b, c and d).^ 

Figure 2 reports the results of applying the model to each type of sequence in terms of reac- 
tion times (RT) and odds of guessing, averaged per training block of 12 stimuli over 1000 runs. 
Since there are four distinct stimulus positions, the unique sequence is repeated three times per 
block and the mixed sequence twice per block while the ambiguous sequence appears once per 
block. Reaction times and odds are plotted on a log-log scale as a function of training block. In 
human subjects unique sequences are the easiest to learn and can be learned unconsciously over 
time (Cohen, Ivry, & Keele 1990) while ambiguous sequences are the hardest to learn and are 
thought to require attentional effort to be acquired (Curran & Keele 1993; but see Stabler 1995 
andFrensch, Lin, & Buchner 1998 for diverging evidence), with mixed sequences being a mid- 



2 

In typical de Bruijn sequences each stimulus is also followed by itself once, but we adhered here to 
a convention of not repeating consecutive stimuli. Any such repeating sequences can be straight- 
forwardly converted into a non-repeating sequence by simply deleting one of the repeating stimuli. 
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die case. Not surprisingly, this is also the case for our model, both in terms of reaction times 
and odds of guessing, with learning in unique sequences being consistently faster than in am- 
biguous sequences. Learning in mixed sequences is between the two but closer to unique 
sequences than to ambiguous sequences. The reason for this latter effect is that mixed sequences 
are built in memory starting with their unique stimuli, because unlike for ambiguous stimuli the 
next stimulus position can be reliably predicted from the current stimulus alone, i.e. from retriev- 
ing the basic chunks encoding pairs of stimuli. Thus learning in mixed sequences will be 
primarily determined by its unique stimuli than by its ambiguous stimuli. 





TrainingBlock TrainingBlock 

Figure 2. Reaction times and odds of guessing for the three types of sequences. 



Performance, both in terms of reaction times and odds of guessing, improves roughly linearly 
on the log-log scale, meaning that it follows the power law of practice (Newell & Rosenbloom 
1981) ubiquitous in human performance. This result is a direct consequence of how activation 
grows with practice and how performance is determined by activation. Assuming without loss of 
generality that activation reduces to its base-level component, the approximate form of Act-R’s 
Base-Level Learning Equation can be combined with the Retrieval Time Equation to yield a 
power function of response time as a function of the amount of practice n: 

RT = = F'n RT Practice Equation 



Note that given the default values used for the latency exponent parameter / and the base-level 
decay raterf of 1.0 and 0.5 respectively, this yields the power law slope of about 0.5 observed on 
the left of Figure 2. Similarly, the Base-Level Learning Equation can be combined with the 
Retrieval Probability Equation to yield a power function of the odds of retrieval (i.e. guessing the 
position of the next stimulus) as a function of the amount of practice: 



Odds- 
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Odds Practice Equation 
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While perfomiance improvement is generally linear on the log-log scale, the curves are actu- 
ally composed of two separate quasi-linear segments of dilferent slopes. This is another 
consequence of the way the sequence is learned, i.e. by starting with a pair of stimuli and itera- 
tively adding chunks on that root to cover the whole sequence. In the first phase of the learning, 
all pairs can be learned in parallel because retrieval is infrequent and the context usually defaults 
to the previous stimulus. However, when the pairs of stimuli can be retrieved and the model 
attempts to build longer sequences, learning becomes increasingly sequential with ultimately a 
single set of chunks spanning the whole sequence. Thus learning still follows a power law but 
becomes much slower as indicated by the much-reduced slope of the improvement. Note that the 
transition from the first to the second phase happens faster for the unique sequence than for the 
mixed and for the ambiguous sequences because each training block contains three repetitions of 
the unique sequence and two of the mixed sequence for only one of the ambiguous sequence. 




Training Block 

Figure 3. Odds of error for mixed and ambiguous sequenees. 



Finally, one can examine the performance of the model in terms of the odds of error of the 
process of guessing the next stimulus as plotted in Figure 3. Because each stimulus is unambi- 
guously predicted by its predecessor in unique sequences, no errors occurred in that case and the 
odds of error are only plotted for mixed and ambiguous sequencesl Odds of error initially in- 
crease in the first learning phase as chunks encoding pairs of stimuli become more active and the 
odds of guessing increase^', then decrease as the model enters the second learning phase and re- 



^ A mechanism called partial matching that adjusts chunk activations according to the degree of 
match could produce errors for unique sequences as well but was not activated in the model for 
simplicity reasons. 

Interestingly, for the first half-dozen training blocks odds of error for ambiguous sequences are 
almost equal to odds of guessing, meaning that almost all guesses are incorrect. This is an interest- 
ing property of the interaction between de Bruijn sequences and base-level decay, in which the 
most active chunks are those encoding the most recent appearances of the stimulus, which predict 
the other stimuli than the one about to occur. 
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trievals are based on unambiguous contexts rather than ambiguous stimuli. Odds of error are 
initially higher for mixed sequences than for ambiguous sequences because the odds of guessing 
are much higher, but odds of error at and after the maximum are about an order of magnitude 
lower for mixed sequences than for ambiguous sequences. In conclusion, by any of the three 
measures used to assess performance, the model exhibits tbe same qualitative sensitivity as hu- 
man subjects to sequence difficulty reflecting stimuli ambiguity (Cohen et al. 1990). 

3. IMemory Decay Rate 

Another characteristic of human sequence learning, indeed of any type of memory, is decay, i.e. 
the gradual forgetting of information (Rubin & Wenzel 1996). Forgetting is often described as 
an unnecessary shortcoming of human cognition. In a subsequent section, we will see that it is 
in fact an essential attribute in dealing with a changing world. But even when considering a 
single sequence as in this section, it can have desirable effects. Figure 4 plots the two measures 
of learning speed, reaction times and odds of guessing, as a function of practice for memory 
decay rates d of 0.25, 0.5 (the default) and 0.75^ As expected, a higher decay rate slows down 
learning both in terms of reaction time and odds of retrieval while a lower decay rate speeds it up. 
Since the analysis in the previous section showed that performance is a power of 1-d, the differ- 
ence between the low and the medium noise (as measured by tbe ratio 1-0.25/1-0.5=1.5) is larger 
than the difference between medium and bigb noise {1-0. 5/1-0. 75=2.0). More generally, differ- 
ences in performance tend to get amplified at the low end, as was the case in the previous 
section. The graph for odds of guessing confirms the analysis of the Odds Practice Equation 
with the slope of the curve reflecting the 1-d exponent while the impact on reaction time primar- 
ily reflects the influence of d on the constant factor F’ rather than the slope. 
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Figure 4. Reaction times and odds of guessing for three memory decay rates. 



100 



The ambiguous sequence described in the previous section was used because it generated higher 
error rates but similar results could be obtained with the mixed (or any non-unique) sequence as 
well. 
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While lower decay rates uniformly inerease learning speed, they have a less beneficial effect 
on the error rate. Figure 5 plots the odds of error as a function of practice and as a function of the 
odds of guessing for 40 training blocks. Odds of error are initially higher for lower decay rates, 
then reverse that trend after reaching their maximum. Part of the reason for the initially higher 
error rates for low decay rates is that their odds of guessing are higher as well, as Figure 4 estab- 
lished. But that is not the whole story, as the seeond part of Figure 5 demonstrates. The odds 
of error are initially the same for equivalent odds of guessing for all decay rates, but lower decay 
rates, which produce faster learning, also lead to significantly higher maximum error rates before 
decreasing in the second learning phase as longer, less ambiguous segments of the sequence are 
eonstrueted. Conversely, the high decay rate that produced slow learning also helps in limiting 
the odds of error to a maximum that is less than half of the lowest decay rate. 





Odds of Guess 



Figure 5. Odds of error as a function of practice and of odds of guessing for 3 memory decay rates. 



In conclusion, forgetting in the form of memory decay, while making for slower learning also 
helps in limiting the number of errors. This suggests a tradeoff in which a decay rate is used that 
provides for reasonably fast learning while keeping errors down. Such a rate might well be the 
Act-R default rate of 0.5, which provides a learning speed almost as fast as the lower decay rates 
(see Figure 4) but has a significantly lower error peak (see Figure 5). Very similar results can be 
obtained by analyzing the influence of the threshold in activation that a chunk must reach to be 
successfully retrieved (or equivalently the Response-to-Stimulus Interval of the experiment since 
retrieval time is determined by activation). A low retrieval threshold, like a low decay rate, will 
allow faster learning at the expense of higher error rates. A high retrieval threshold, while pre- 
venting retrieval in the early stages of learning, will also yield fewer errors. Thus a threshold on 
retrieval that requires a certain amount of practice to allow a memory to be recalled is not neces- 
sarily a cognitive limitation but instead might serve the role of preventing interference between 
memory items by delaying their recall until they are reliably established. 
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3.3 Activation Noise 

Random noise in the aetivation quantities controlling the retrieval of memory chunks also ap- 
pears, like memory decay, to be purely a limitation of human memory without any redeeming 
value. Indeed, it was introduced in ACT-R primarily to provide error profiles comparable to 
those of human subjects. However it turns out that stochasticity has advantageous consequences 
as well. Figure 6 plots the reaction times and odds of guessing as a function of three activation 
noise levels S of 0.1, 0.25 (our default used in many other simulations, e.g. Lebiere (1998, 
1999) and Lebiere & West (1999)) and 0.5. The effect on speed of learning is mixed: while 
lower noise levels produce faster performance, they also lead to lower odds of guessing, at least 
initially. The reason for this divergence lies in the different ways in which activation affects 
those measures of performance. Noise is symmetrically distributed around a mean of 0. Since 
the retrieval time is a negative exponential function of activation, negative noise amounts will 
slow down retrieval more than positive amounts will speed it up. This will lead to slower re- 
trievals in average for higher noise levels. As can be seen in the first part of Figure 6, the default 
noise level of 0.25 is only slightly slower than the low noise level of 0.1, but the higher noise 
level of 0.5 is markedly slower. 





TrainingBlock TrainingBlock 

Figure 6. Reaction times and odds of guessing for three activation noise levels. 



The impact of noise on odds of retrieval, i.e. guessing the next stimulus’ position, will de- 
pend on whether chunk activation is higher or lower than the retrieval threshold. If it is lower, as 
it is early in the learning, activation noise can only help in reaching the threshold. Therefore, the 
larger the amount of noise, the better the odds that the chunk can be retrieved. However, consis- 
tently with the Odds Practice equation, the slope of the odds curve as a function of practice is 
inversely proportional to the noise level S, and, after the first learning phase, the relationship is 
inverted. The reason is that the average activation for the chunks encoding pairs of stimuli is 
now higher than the threshold and adding noise can only increase the risk of retrieval failures. 
But because it helps in retrieving the newer chunks that still have low activation, the total effect 
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of noise in the second learning phase is weak and its main consequence is the beneficial effect of 
activation noise in speeding up the first learning phase. 

With increased odds of guessing come increased odds of error, as demonstrated in the first 
part of Figure 7. Higher noise also means higher odds of error through almost all the training 
except the time of maximum error at about the 10* training block. However, as the second part 
of Figure 7 establishes, those increased odds of error mostly reflect the increased odds of guess- 
ing. Only in the second part of the learning does higher noise lead to slightly higher odds of 
error for equal odds of guessing. 





Figure 7. Odds of error as a function of practice and of odds of guessing for 3 activation noise levels. 



Once again, a tradeolf emerges, this time between faster initial learning, but slightly higher error 
rates resulting from higher noise levels. The standard noise value of 0.25 seems like a good 
compromise, but the optimal value will depend on the relative importance of initial versus long- 
term performance.*’ This in turn will depend upon the rate at which the environment changes, 
which is the focus of the next section. 

3.4 Sequence Change 

The results presented in the previous sections referred to a model that learned a sequence given a 
clean, empty memory without any existing knowledge to interfere with the new sequence. Real 
life, of course, is not always so convenient. Sequences in the real world, be it weather, speech or 
stock market data, always change, frequently without warning. This section analyzes the per- 
formance of the Act-R model that is trained on a sequence that is then changed to a different 



^ A way out of this tradeoff would be to have a variable noise level, initially high to provide fast 
learning then gradually decreasing to bring down the error rate. Lebiere (1998, 1999) observed that 
this gradual noise reduction with practice would also help account for long-term learning in do- 
mains such as cognitive arithmetic and is related to the technique of simulated annealing used in 
the Boltzmann Machine (Ackley, Hinton, & Sejnowsky, 1985). 
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one. Specifically, the model learns the standard unique sequence a b c d for 10 training blocks, 
then learns another unique but incompatible sequence, d c b a, for 20 training blocks. Only the 
performance in those latter 20 blocks is plotted in the following graphs. The results would have 
been qualitatively similar but quantitatively less distinguishable if other types of sequences were 
used because those sequences would have had small fragments in common that would have per- 
mitted some knowledge transfer from one sequence to the next. For example, two ambiguous de 
Bruijn sequences have in common all pairs of stimuli. 

As one would expect phenomena such as forgetting and stochasticity to impact the model’s 
ability to switch from one sequence to the next, the sensitivity analysis carried out in the two 
previous sections on memory decay rate and activation noise level is repeated here, with striking 
results as can be seen in Figure 8. While higher decay rates still make for slower performance in 
terms of reaction time, the difference tends to disappear over time. Moreover, while higher decay 
rates also yields lower odds of guessing, the shape of the curves is very different, with odds of 
guessing for lower decay rates starting very high and going even higher before gradually decreas- 
ing while odds of guessing for higher decay rates starting much lower and decreasing before 
gradually picking up. The reason for this behavior is that the model keeps retrieving the old 
sequence when the new one is introduced, a process that for low enough decay rates is self- 
sustaining and effectively blocks out the new sequence for many training blocks. For high decay 
rates such as 0.75, forgetting is strong enough to quickly wipe out the old sequence, decreasing 
the odds of guessing, before the new sequence is gradually learned. 
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Figure 8. Reaction times and odds of guessing for three memory decay rates. 



Figure 9, which plots the odds of error, makes clear the downside of low decay rates in the pres- 
ence of a changing environment. The odds of error for low decay rates become very large because 
the model keeps anticipating the old sequence and only very gradually decline to about even at 
the end of the 20 training blocks. In contrast, for the high decay rate the odds of error start about 
even and are down to almost zero by the end of the 20 blocks. The second part of Figure 9 plots 
the odds of error as a function of guessing odds, displaying a very interesting trajectory in that 
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State space. The vertical bifurcation in the graph suggests a phase transition for a given value of 
the decay parameter’. For lower decay rates, odds of guessing and error grow very large before 
slowly decreasing, while for higher decay rates the model adopts the sensible strategy of mostly 
stopping guessing until the new sequence can be acquired to minimize errors. 
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Figure 9. Odds of error as a function of practice and of odds of guessing for 3 memory decay rates. 



A similar analysis for activation noise levels yields much the same results, reported in 
Figure 10 and 11. 
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Figure 10. Reaction times and odds of guessing for three activation noise levels. 



Lebiere (1998, 1999) found a similar bifurcation for activation noise values, with values higher 
than a certain threshold leading in the limit to chaotic behavior while values lower than that 
threshold yielding convergence to a stable memory state. 
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Figure 11. Odds of error as a function of practice and of odds of guessing for three noise levels. 



If anything, low noise levels sueh as 0.1 yield even more undesirable extreme behavior 
than low decay rates, yielding very high error rates and slow performance beeause the lack of 
sufficient stochasticity prevents adapting to the new sequenee. That is because memory rein- 
forcement is a dynamie process with its own feedback loop. Retrieving an item from memory 
reinforces it, which in turns makes it more likely to be retrieved. Thus, if the system has 
sufficient stochasticity, the new sequence, while weaker at first than the first one, can be re- 
trieved early by chanee, whieh then lets it get established earlier. A very deterministic system, 
on the other hand, takes much longer to let purely external reinforcement gradually push the 
aetivation of the new sequence above that of the initial one, with the help of memory decay. 
One can observe that the middle activation noise value of 0.25 yields fairly good results, eon- 
verging by the end of the 20 training blocks to the results for the best value of 0.5, while 
performanee for the lower noise value of 0. 1 remains quite poor. 

In conelusion, this analysis confirms and greatly reinforees the eonclusion that forgetting and 
stoehasticity are not undesirable characteristics of a suboptimal cognitive system but are instead 
fundamental tools in adapting to a ehanging environment. Moreover, the default values of the 
parameters controlling those phenomena in ACT-R, namely memory decay rate and chunk activa- 
tion noise, might well be close to the optimum for a wide range of eonditions. 

3.5 Knowledge Representation 

A fundamental characteristic of our sequence learning model is the way it represents the se- 
quence by building chunk pairs first from basic stimuli then from other ehunks, gradually 
stretehing to span the whole sequence. This representation is very flexible, allowing the stor- 
age of sequenees of arbitrary length. But a disadvantage is that the representation of any given 
sequence is still of a finite length and when the model reaehes the end it fails to antieipate the 
next stimulus and must start anew. An alternative representation would be to simply store the 
sequenee in chunks of a given fixed length by storing all subsequences of that length, essen- 
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tially achieving a tiling effect. For example, if all chunks are of length 2, i.e. pairs, one would 
represent the unique sequence a b c d by storing the 4 ehunks a-b, b-c, c-d and d-a.* The 
model would operate essentially unchanged other than for the simplifieation of not needing to 
build ehunks more complex than pairs of stimuli. Figure 12 plots the learning provided by 
that model for the three standard types of sequences. Generally, odds of guessing increase very 
quiekly and indefinitely according to the power law of the Odds Practice Equation, while reac- 
tion times also decrease quiekly to the minimum reflecting the fan (Anderson 1974) of each 
sequence element. Ambiguous sequences are still more difficult than unique sequences for the 
same reason as the original model, namely that the stimuli appear in multiple positions, and 
therefore in multiple ehunks, thereby diluting the activation spreading from the context. 





Training Block Training Block 

Figure 12. Reaction times and odds of guessing for the pair model on the 3 types of sequences. 



Flowever, while unique sequenees are learned without errors. Figure 13 shows conelusively that 
the error rates for mixed and ambiguous sequenees gradually inerease until reaching a plateau 
determined by the inherent ambiguity of the sequence given this fixed representation. For 
example, since ambiguous sequenees have no stimulus that by itself identifies its successor 
without ambiguity, it has the highest error rate. 



* These two models essentially reflect the general choice between fixed and flexible windows which 
also arises in other frameworks such as neural networks. The original model that gradually build 
its context corresponds to Elman’s (1990) recursive networks while the fixed-length model corre- 
sponds to networks with fixed-input fields such as NetTalk (Sejnowski & Rosenberg, 1987). 
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Figure 13. Odds of error as a function of practice and of odds of guessing for the pair model. 



An obvious solution is to simply increase the length of the fixed chunks, say Ifom pairs to 
triplets. This effectively allows the model to learn all types of sequences without any errors, but 
Figure 14 establishes that that perfection comes at a cost. While reaction times are still quite 
fast, odds of guessing still inerease aecording to a power law but mueh more slowly. For exam- 
ple, at the end of the 20 training bloeks, the odds of anticipating the next stimulus of an 
ambiguous sequence are still less than even. This is a direct consequence of the increased dilu- 
tion of spreading activation resulting from the growth in the size of the chunks, leading eaeh 
stimulus to appear in more chunks and making them less predietive. 




Training Block 




Figure 14. Reaction times and odds of guessing for the triplet model on the 3 types of sequences. 
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Thus, smaller chunks such as pairs cannot adequately represent ambiguous sequences while 
larger chunks such as triplets are much too slow in building chunks. This suggests that, perhaps 
like human cognition, the original model, despite its flaws such as its tendency to make errors, 
might still represent the optimal solution for representing and accessing sequences. 

4. Discussion 

While the elements of the sequences learned by this model are discrete, many sequences encoun- 
tered in real life are composed of continuous quantities, for example the prices on the stock 
market. Fortunately, there is a straightforward way to extend the model to handle sequences of 
continuous quantities by using ACT-R’s partial-matching mechanism. While the sequence is still 
represented by symbolic chunks, the retrieval of chunks is affected by the similarities between the 
values that they hold. Specifically, the chunk with the highest match score is retrieved, where 
match score is a function of the chunk activation and its degree of mismatch to the desired val- 
ues: 



= A- - MP^ (1 - Sim{v,d)) Partial Match Equation 

v,d 

where Sim(v,d) is the similarity between the desired value v held in the goal and the actual value 
d held in the retrieved chunk, allowing the representation of continuous quantities. Thus even if 
no chunk in memory perfectly matches the current context, a likely occurrence with an infinite 
amount of continuous values, the chunk holding the closest value can be retrieved if its match 
score after subtracting the mismatch between values from its activation is still higher than the 
retrieval threshold. A shortcoming of partial matching is that there will always be a residual error 
for the imperfection of the match. Lebiere (1999) proposed a generalization of the retrieval 
mechanism called blending which allows the retrieval and averaging of values ifom multiple 
chunks rather than a single one, providing for sequences of continuous values a powerful kind of 
interpolation that has proved usefiil for a range of paradigms of implicit learning (Wallach & 
Lebiere 1999; Gonzalez, Lebiere & Lerch 1999). Specifically, the value retrieved is: 

V = Min^ Sim{ V, V. )) Blending Eqnation 



where R, is the probability of retrieving chunk i and Vi is the value held by that chunk. Unfortu- 
nately, space limitations do not allow for a detailed analysis of the application of this model 
extension to continuous sequences. 

The Act-R model also has the capacity to learn the “noisy” finite-state grammars of Cleere- 
mans and McClelland (1991). In that paradigm, continuous sequences are generated from a 
finite-state grammar looping onto itself, with a small probability (e.g. 15%) of substituting at 
each point in the sequence a random stimulus for the one prescribed by the grammar. Like the 
Srn networks used by Cleeremans and McClelland, the chunks built by the Act-R model can 
span as large a segment of the sequence as necessary. Moreover, if several possible stimuli can 
follow the current sequence Ifagment, they will be anticipated with a probability equal to their 
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conditional probability of occurrence in tbe sequence since tbe activations of a chunk represent the 
odds of that chunk occurring in the environment. This correspondence between ACT-R and Srn 
is to be expected since Act-R uses Bayesian learning mechanisms and neural networks have been 
shown to converge to Bayesian optimum under certain conditions (Mitchell, 1997). 

Perhaps the main advantage of a cognitive architecture such as Act-R is its generality, spe- 
cifically the ability to apply a model, if not literally then in the form of design patterns and 
parameter values, to a very different task than the one for which it was developed. For example, 
Lebiere and West (1999) applied this model of sequence learning with great success to the classic 
game ofPaper Rock Scissors. Interestingly, for reasons of comparability with an existing neural 
network model, they chose the fixed-length chunk representation described in tbe previous sec- 
tion. They concluded that the default parameters used by tbe model presented here provided 
excellent performance that matched human behavior quite well, confirming the analysis presented 
here. Moreover the pair model provided the best fit to the initial performance of human subjects 
while the triplet model provided the best fit to tbeir long-term performance, supporting tbe origi- 
nal variable-length representation which starts with pairs of stimuli and gradually builds longer 
sequences upon them. Another related model is the model of list learning of Anderson et al. 
(1997). Like the sequence learning model, it acquires a list of items by building chunks associ- 
ating indices in the list to the items stored at that position. The indices have two components, 
one being the identity of the subgroup bolding the item and the other is the position of the item 
in that subgroup. Clearly there is a parallel between those indices and the eontext elements of 
chunks in this model that should be pursued further. Finally, Lebiere (1998; 1999) performed a 
sensitivity analysis of a model of lifetime learning of eognitive arithmetic that yielded many of 
the same conclusions as this analysis, including the desirability of eharacteristies such as forget- 
ting, stochastieity and retrieval failure to learning even in as precise an environment as formal 
mathematics. One can therefore conjecture that the lessons drawn Ifom the analysis presented 
here are applicable not only to this particular domain but to many other cognitive tasks as well. 

5. Conclusion 

This chapter presents a model of sequence learning based on the Act-R cognitive architecture. 
This model acquires sequences by gradually learning small pieces of the sequence in symbolic 
structures called chunks. The availability of those chunks is a function of their numerical pa- 
rameters, which are estimated by the architecture’s Bayesian learning mechanisms to reflect the 
structure of the environment. Therefore, the sequence is represented both explicitly in the sym- 
bolic chunks as well as implicitly in their real-valued parameters. In a recent review, Cleeremans 
et al. (1998, p. 414) identify the following properties for successful learning models in the field: 

■ Learning involves elementary association processes that are highly sensitive to the statisti- 
cal features of the training set. 

■ Learning is incremental, continuous, and best characterized as a by-product of ongoing 
processing. 

■ Learning is based on the processing of exemplars and produces distributed knowledge. 

■ Learning is unsupervised and self-organizing. 
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The Act-R approach to sequential pattern acquisition seems to be well characterized by the 
properties listed above: the model learns to encode chunks (i.e. exemplars) from basic stimuli 
that gradually stretch to span the whole sequence. Learning of elementary associations between 
subsequent objects is conceptualized by the model as a non-deliberate by-product of performance 
that is highly sensitive to the statistical features of the particular training sequence. Learning is 
unsupervised in the sense that the model essentially starts with the encoding of instructions on 
the respective stimulus-key mappings and incrementally builds up explicit and implicit knowl- 
edge (i.e. chunks and their subsymbolic association values) to prepare responses in the serial 
reaction task. 

This chapter presented a detailed analysis of the model’s sensitivity to its parameters and ex- 
perimental conditions. The main conclusion is that two central characteristics of cognitive models, 
stochasticity and forgetting, are in fact essential in optimizing the performance of the model. We also 
studied alternative knowledge representations and found them to have advantages but also serious 
shortcomings relative to the standard model. This model is related to a number of other Act-R 
models applied to a wide range of domains, and it is hoped that both the techniques used here and the 
conclusions drawn are applicable to a wide range of cognitive domains. 
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1 Introduction 

The most challenging open issues in sequential decision making include partial 
observability of the decision maker’s environment, hierarchical and other types 
of abstract credit assignment, the learning of credit assignment algorithms, and 
exploration without a priori world models. I will summarize why direct search 
(DS) in policy space provides a more natural framework for addressing these 
issues than reinforcement learning (RL) based on value functions and dynamic 
programming. Then I will point out fundamental drawbacks of traditional DS 
methods in case of stochastic environments, stochastic policies, and unknown 
temporal delays between actions and observable effects. I will discuss a remedy 
called the success-story algorithm, show how it can outperform traditional DS, 
and mention a relationship to market models combining certain aspects of DS 
and traditional RL. 

Policy learning. A learner’s modifiable parameters that determine its be- 
havior are called its policy. An algorithm that modifies the policy is called a 
learning algorithm. In the context of sequential decision making based on rein- 
forcement learning (RL) there are two broad classes of learning algorithms: (1) 
methods based on dynamic programming (DP) (Bellman, 1961), and (2) direct 
search (DS) in policy space. DP-based RL (DPRL) learns a value function map- 
ping input/action pairs to expected discounted future reward and uses online 
variants of DP for constructing rewarding policies (Samuel, 1959; Barto, Sut- 
ton, & Anderson, 1983; Sutton, 1988; Watkins, 1989; Watkins & Dayan, 1992; 
Moore & Atkeson, 1993; Bertsekas & Tsitsiklis, 1996). DS runs and evaluates 
policies directly, possibly building new policy candidates from those with the 
highest evaluations observed so far. DS methods include variants of stochastic 
hill-climbing (SHC), evolutionary strategies (Rechenberg, 1971; Schwefel, 1974), 
genetic algorithms (GAs) (Holland, 1975), genetic programming (GP) (Gramer, 
1985; Banzhaf, Nordin, Keller, & Francone, 1998), Levin Search (Levin, 1973, 
1984), and adaptive extensions of Levin Search (Solomonoff, 1986; Schmidhuber, 
Zhao, & Wiering, 1997b). 

Outline. DS offers several advantages over DPRL, but also has some draw- 
backs. I will list advantages first (section 2), then describe an illustrative task 
unsolvable by DPRL but trivially solvable by DS (section 3), then mention a 
few theoretical results concerning DS in general search spaces (section 4), then 
point out a major problem of DS (section 5), and offer a remedy (section 6 and 
section 7). 
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2 Advantages of Direct Search 

2.1 DS Advantage 1: No States 

Finite time convergence proofs for DPRL (Kearns & Singh, 1999) require (among 
other things) that the environment can be quantized into a finite number of di- 
screte states, and that the topology describing possible transitions from one state 
to the next, given a particular action, is known in advance. Even if the real world 
was quantizable into a discrete state space, however, for all practical purposes 
this space will be inaccessible and remain unknown. Current proofs do not cover 
apparently minor deviations from the basic principle, such as the world-class 
RL backgammon player (Tesauro, 1994), which uses a nonlinear function appro- 
ximator to deal with a large but finite number of discrete states and, for the 
moment at least, seems a bit like a miracle without full theoretical foundation. 
Prior knowledge about the topology of a network connecting discrete states is 
also required by algorithms for partially observable Markov decisicion processes 
(POMDPs), although they are more powerful than standard DPRL, e.g., (Kael- 
bling, Littman, & Cassandra, 1995; Littman, Cassandra, & Kaelbling, 1995). In 
general, however, we do not know a priori how to quantize a given environment 
into meaningful states. 

DS, however, completely avoids the issues of value functions and state iden- 
tification — it just cares for testing policies and keeping those that work best. 

2.2 DS Advantage 2: No Markovian Restrictions 

Convergence proofs for DPRL also require that the learner’s current input con- 
veys all the information about the current state (or at least about the optimal 
next action). In the real world, however, the current sensory input typically tells 
next to nothing about the “current state of the world,” if there is such a thing 
at all. Typically, memory of previous events is required to disambiguate inputs. 
For instance, as your eyes are sequentially scanning the visual scene dominated 
by this text you continually decide which parts (or possibly compressed descrip- 
tions thereof) deserve to be represented in short-term memory. And you have 
presumably learned to do this, apparently by some unknown, sophisticated RL 
method fundamentally different from DPRL. 

Some DPRL variants such as Q{\) are limited to a very special kind of ex- 
ponentially decaying short-term memory. Others simply ignore memory issues 
by focusing on suboptimal, memory-free solutions to problems whose optimal 
solutions do require some form of short-term memory (Jaakkola, Singh, & Jor- 
dan, 1995). Again others can in principle find optimal solutions even in partially 
observable environments (POEs) (Kaelbling et ah, 1995; Littman et ah, 1995), 
but they (a) are practically limited to very small problems (Littman, 1996), and 
(b) do require knowledge of a discrete state space model of the environment. To 
various degrees, problem (b) also holds for certain hierarchical RL approaches to 
memory-based input disambiguation (Ring, 1991, 1993, 1994; McCallum, 1996; 
Wiering & Schmidhuber, 1998). Although no discrete models are necessary for 
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DPRL systems with function approximators based on recurrent neural networks 
(Schmidhuber, 1991c; Lin, 1993), the latter do suffer from a lack of theoretical 
foundation, perhaps even more so than the backgammon player. 

DS, however, does not care at all for Markovian conditions and full observa- 
bility of the environment. While DPRL is essentially limited to learning reactive 
policies mapping current inputs to output actions, DS in principle can be ap- 
plied to search spaces whose elements are general algorithms or programs with 
time-varying variables that can be used for memory purposes (Williams, 1992; 
Teller, 1994; Schmidhuber, 1995; Wiering & Schmidhuber, 1996; Salustowicz & 
Schmidhuber, 1997). 

2.3 DS Advantage 3: Straight-Forward Hierarchical Credit 
Assignment 

There has been a lot of recent work on hierarchical DPRL. Some researchers ad- 
dress the case where an external teacher provides intermediate subgoals and/or 
prewired macro actions consisting of sequences of lower-level actions (Moore & 
Atkeson, 1993; Tham, 1995; Sutton, 1995; Singh, 1992; Humphrys, 1996; Digney, 
1996; Sutton, Singh, Precup, & Ravindran, 1999). Others focus on the more am- 
bitious goal of automatically learning useful subgoals and macros (Schmidhuber, 
1991b; Eldracher & Baginski, 1993; Ring, 1991, 1994; Dayan & Hinton, 1993; 
Wiering & Schmidhuber, 1998; Sun & Sessions, 2000). Compare also work pre- 
sented at the recent NIPS*98 workshop on hierarchical RL organized by Doina 
Precup and Ron Parr (McGovern, 1998; Andre, 1998; Moore, Baird, & Kaelb- 
ling, 1998; Bowling & Veloso, 1998; Harada & Russell, 1998; Wang & Mahadevan, 
1998; Kirchner, 1998; Coelho & Grupen, 1998; Huber & Grupen, 1998). 

Most current work in hierarchical DPRL aims at speeding up credit assign- 
ment in fully observable environments. Approaches like HQ-learning (Wiering & 
Schmidhuber, 1998), however, additionally achieve a qualitative (as opposed to 
just quantitative) decomposition by learning to decompose problems that cannot 
be solved at all by standard DPRL into several DPRL-solvable subproblems and 
the corresponding macro-actions. 

Generally speaking, non-trivial forms of hierarchical RL almost automatically 
run into problems of partial observability, even those with origins in the MDP 
framework. Feudal RL (Dayan & Hinton, 1993), for instance, is subject to such 
problems (Ron Williams, personal communication). As Peter Dayan himself puts 
it (personal communication): “Higher level experts are intended to be explicitly 
ignorant of the details of the state of the agent at any resolution more detailed 
than their action choice. Therefore, the problem is really a POMDP from their 
perspective. It’s easy to design unfriendly state decompositions that make this 
disastrous. The key point is that it is highly desirable to deny them information 
- the chief executive of [a major bank] doesn’t really want to know how many 
paper clips his most junior bank clerk has - but arranging for this to be benign 
in general is difficult. ” 

In the DS framework, however, hierarchical credit assignment via frequently 
used, automatically generated subprograms becomes trivial in principle. For in- 
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stance, suppose policies are programs built from a general programming langu- 
age that permits parameterized conditional jumps to arbitrary code addresses 
(Dickmanns, Schmidhuber, & Winklhofer, 1987; Ray, 1992; Wiering & Schmid- 
huber, 1996; Schmidhuber et ah, 1997b; Schmidhuber, Zhao, & Schraudolph, 
1997a). DS will simply keep successful hierarchical policies that partially reuse 
code (subprograms) via appropriate jumps. Again, partial observability is not 
an issue. 



2.4 DS Advantage 4: Non-Hierarchical Abstract Credit Assignment 

Hierarchical learning of macros and reusable subprograms is of interest but li- 
mited. Often there are non-hierarchical (nevertheless exploitable) regularities in 
solution space. For instance, suppose we can obtain solution B by replacing every 
action " tum(right)" in solution A by turn (left).'" B will then be regular in the 
sense that it conveys little additional conditional algorithmic information, given 
A (Solomonoff, 1964; Kolmogorov, 1965; Chaitin, 1969; Li & Vitanyi, 1993), 
that is, there is a short algorithm computing B from A. Hence B should not be 
hard to learn by a smart RL system that already found A. While DPRL cannot 
exploit such regularities in any obvious manner, DS in general algorithm spaces 
does not encounter any fundamental problems in this context. For instance, all 
that is necessary to find B may be a modification of the parameter “right” of a 
single instruction “turn(right)” in a repetitive loop computing A (Schmidhuber 
et ah, 1997b). 



2.5 DS Advantage 5: Metalearning Potential 

In a given environment, which is the best way of collecting reward? Hierarchical 
RL? Some sort of POMDP-RL, or perhaps analogy-based RL? Combinations 
thereof? Or other nameless approaches to exploiting algorithmic regularities in 
solution space? A smart learner should find out by itself, using experience to 
improve its own credit assignment strategy (metalearning or “learning to learn” ) 
(Lenat, 1983; Schmidhuber, 1987). In principle, such a learner should be able 
to run arbitrary credit assignment strategies, and discover and use “good” ones, 
without wasting too much of its limited life-time (Schmidhuber et ah, 1997a). It 
seems obvious that DPRL does not provide a useful basis for achieving this goal, 
while DS seems more promising as it does allow for searching spaces populated 
with arbitrary algorithms, including metalearning algorithms. I will come back 
to this issue later. 

Disclaimer: of course, solutions to almost all possible problems are irregu- 
lar and do not share mutual algorithmic information (Kolmogorov, 1965; So- 
lomonoff, 1964; Chaitin, 1969; Li & Vitanyi, 1993). In general, learning and 
generalization are therefore impossible for any algorithm. But it’s the compara- 
tively few, exceptional, low- complexity problems that receive almost all attention 
of computer scientists. 
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2.6 Advantage 6: Exploring Limited Spatio-Temporal Predictability 



Knowledge of the world may boost performance. That’s why exploration is a 
major RL research issue. How should a learner explore a spatio-temporal do- 
main? By predicting and learning from success/failure what’s predictable and 
what’s not. 

Most previous work on exploring unknown data sets has focused on selec- 
ting single training exemplars maximizing traditional information gain (Fedorov, 
1972; Hwang, Choi, Oh, & II, 1991; MacKay, 1992; Plutowski, Cottrell, & White, 
1994; Cohn, 1994). Here typically the concept of a surprise is defined in Shan- 
non’s sense (Shannon, 1948): some event’s surprise value or information content 
is the negative logarithm of its probability. This inspired simple reinforcement 
learning approaches to pure exploration (Schmidhuber, 1991a; Storck, Hochrei- 
ter, & Schmidhuber, 1995; Thrun & Moller, 1992) that use adaptive predictors 
to predict the entire next input, given current input and action. The basic idea 
is that the action-generating module gets rewarded in case of predictor failures. 
Hence it is motivated to generate action sequences leading to yet unpredictable 
states that are “informative” in the classic sense. Some of these explorers ac- 
tually like white noise simply because it is so unpredictable, thus conveying a lot 
of Shannon information. Compare alternative, hardwired exploration strategies 
(Sutton & Pinette, 1985; Kaelbling, 1993; Dayan & Sejnowski, 1996; Koenig & 
Simmons, 1996). 

Most existing systems either always predict all details of the next input or 
are limited to picking out simple statistic regularities such as “performing action 
A in discrete, fully observable environmental state B will lead to state C with 
probability 0.8.” They are not able to limit their predictions solely to certain 
computable aspects of inputs (Schmidhuber & Prelinger, 1993) or input sequen- 
ces, while ignoring random and irregular aspects. For instance, they cannot even 
express (and therefore cannot find) complex, abstract, predictable regularities 
such as “executing a particular sequence of eye movements, given a history of 
incomplete environmental inputs partially caused by a falling glass of red wine, 
will result in the view of a red stain on the carpet within the next 3 seconds, 
where details of the shape of the stain are expected to be unpredictable and left 
unspecified. ” 

General spatio-temporal abstractions and limited predictions of this kind 
apparently can be made only by systems that can run fairly general algorithms 
mapping input/action sequences to compact internal representations conveying 
only certain relevant information embedded in the original inputs. For instance, 
there are many different, realistic, plausible red stains — all may be mapped 
onto the same compact internal representation predictable from all sequences 
compatible with the abstraction “falling glass.” If the final input sequence caused 
by eye movements scanning the carpet does not map onto the concept “red 
stain” (because the glass somehow decelerated in time and for some strange 
reason never touched the ground), there will be a surprise. There won’t be a 
surprise, however, if the stain exhibits a particular, unexpected, irregular shape. 
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because there was no explicit confident expectation of a particular shape in the 
first place. 

The central questions are: In a given environment, in absence of a state model, 
how to extract the predictable concepts corresponding to algorithmic regularities 
that are not already known? How to discover novel spatio-temporal regularities 
automatically among the many random or unpredictable things that should be 
ignored? Which novel input sequence-transforming algorithms do indeed com- 
pute reduced reduced internal representations permitting reliable predictions? 

Usually we cannot rely on a teacher telling the system which concepts are 
interesting, such as in the EURISKO system (Lenat, 1983). The DPRL frame- 
work is out of the question due to issues of partial observability. Lookup-table 
approaches like those used in more limited scenarios are infeasible due to the 
huge number of potentially interesting sequence-processing, event-memorizing 
algorithms. On the other hand, it is possible to use DS-like methods for buil- 
ding a “curious” embedded agent that differs from previous explorers in the 
sense that it can limit its predictions to fairly arbitrary, computable aspects of 
event sequences and thus can explicitly ignore almost arbitrary unpredictable, 
random aspects (Schmidhuber, 1999). It constructs initially random algorithms 
mapping event sequences to abstract internal representations (IRs). It also con- 
structs algorithms predicting IRs from IRs computed earlier. Its goal is to learn 
novel algorithms creating IRs useful for correct IR predictions, without wasting 
time on those learned before. This can be achieved by a co-evolutionary scheme 
involving two competing modules collectively designing single algorithms to be 
executed. The modules have actions for betting on the outcome of IR predic- 
tions computed by the algorithms they have agreed upon. If their opinions differ 
then the system checks who’s right, punishes the loser (the surprised one), and 
rewards the winner. A DS-like RL algorithm forces each module to increase its 
reward. This motivates each to lure the other into agreeing upon algorithms in- 
volving predictions that surprise it. Since each module essentially can put in its 
veto against algorithms it does not consider profitable, the system is motivated 
to focus on those computable aspects of the environment where both modules 
still have confident but different opinions. Once both share the same opinion on 
a particular issue (via the loser’s DS-based learning process, e.g., the winner is 
simply copied onto the loser), the winner loses a source of reward — an incen- 
tive to shift the focus of interest onto novel, yet unknown algorithms. There are 
simulations where surprise-generation of this kind can actually help to speed up 
external reward (Schmidhuber, 1999). 

2.7 Summary 

Given the potential DS advantages listed above (most of them related to partial 
observability), it may seem that the more ambitious the goals of some RL resear- 
cher, the more he/she will get drawn towards methods for DS in spaces of fairly 
general algorithms, as opposed to the more limited DPRL-based approaches. 

Standard DS does suffer from major disadvantages, though, as I will point 
out later for the case of realistic, stochastic worlds. 
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3 Typical Applications 

Both DPRL and DS boast impressive practical successes. For instance, there is 
the world-class DPRL-based backgammon player (Tesauro, 1994), although it’s 
not quite clear yet (beyond mere intuition) why exactly it works so well. And DS 
has proven superior to alternative traditional methods in engineering domains 
such as wing design, combustion chamber design, turbulence control (the histori- 
cal origins of “evolutionary computation”). There are overviews (Schwefel, 1995; 
Koumoutsakos P. & D., 1998) with numerous references to earlier work. 

Will DS’ successes in such domains eventually carry over to learning sequen- 
tial behavior in domains traditionally approached by variants of DPRL? At the 
moment little work has been done on DS in search spaces whose elements are 
sequential behaviors (Schmidhuber et ah, 1997b), but this may change soon. Of 
course, the most obvious temporal tasks to be attacked by DS are not board 
games but tasks that violate DPRL’s Markovian assumptions. It does not make 
sense to apply low-bias methods like DS to domains that satisfy the preconditi- 
ons of more appropriately biased DPRL approaches. 

Parity. I will use the parity problem to illustrate this. The task requires to 
separate bitstrings of length n > 0 (n integer) with an odd number of zeros from 
others, n-bit parity in principle is solvable by a 3-layer feedforward neural net 
with n input units. But learning the task from training exemplars by, say, back- 
propagation, is hard for n > 20, due to such a net’s numerous free parameters. 
On the other hand, a very simple finite state automaton with just one bit of in- 
ternal state can correctly classify arbitrary bitstrings by sequentially processing 
them one bit at a time, and switching the internal state bit on or off depending 
on whether the current input is 1 or 0. 

A policy implementing such a sequential solution, however, cannot efficiently 
be learned by DPRL. The problem is that the task violates DPRL’s essential 
Markovian precondition: the current input bit in a training sequence does not 
provide the relevant information about the previous history necessary for correct 
classification. 

Next we will see, however, that parity can be quickly learned by the most 
trivial DS method, namely, random search (RS). RS works as follows: REPEAT 
randomly initialize the policy and test the resulting net on a training set UNTIL 
solution found. 

Experiment. Our policy is the weight matrix of a standard recurrent neural 
network. We use two architectures (Al, A2). Al(fc) is a fully connected net with 1 
input, 1 output, and k hidden units, each non-input unit receiving the traditional 
bias connection from a unit with constant activation 1.0. A2 is like Al(lO), but 
less densely connected: each hidden unit receives connections from the input unit, 
the output unit, and itself; the output unit sees all other units. All activation 
functions are standard: f{x) = (1 -I- e~^)~^ G [0.0, 1.0]. Binary inputs are -1.0 
(for 0) and 1.0 (for 1). Sequence lengths n are randomly chosen between 500 
and 600. All variable activations are set to 0 at each sequence begin. Target 
information is provided only at sequence ends (hence the relevant time delays 
comprise at least 500 steps; there are no intermediate rewards) . Our training set 




220 



J. Schmidhuber 



consists of 100 sequences, 50 from class 1 (even; target 0.0) and 50 from class 
2 (odd; target 1.0). Correct sequence classification is defined as “absolute error 
at sequence end below 0.1”. We stop RS once a random weight matrix (weights 
randomly initialized in [-100.0,100.0]) correctly classifies all training sequences. 
Then we test on the test set (100 sequences). 

In all simulations, RS eventually finds a solution that classifies all test set 
sequences correctly; average final absolute test set errors are always below 0.001 
— in most cases below 0.0001. In particular, RS with A1 (fc = 1) solves the 
problem within only 2906 trials (average of 10 trials). RS with A2 solves it within 
2797 trials on average. RS for architecture A2, but without self-connections for 
hidden units, solves the problem within 250 trials on average. See a previous 
paper (Hochreiter & Schmidhuber, 1997) for additional results in this vein. 

RS is a dumb DS algorithm, of course. It won’t work within acceptable time 
except for the most trivial problems. But this is besides the point of this section, 
whose purpose is to demonstrate that even primitive DS may yield results beyond 
DPRL’s abilities. Later, however, I will discuss smarter DS methods. 

What about GP? Given the RS results above, how can it be that parity 
is considered a difficult problem by many authors publishing in the DS-based 
field of “Genetic Programming” (GP), as can be seen by browsing through the 
proceedings of recent GP conferences? The reason is: most existing GP systems 
are extremely limited because they search in spaces of programs that do not 
even allow for loops or recursion — to the best of my knowledge, the first excep- 
tion was (Dickmanns et ah, 1987). Hence most GP approaches ignore a major 
motivation for search in program space, namely, the repetitive reuse of code in 
solutions with low algorithmic complexity (Kolmogorov, 1965; Solomonoff, 1964; 
Ghaitin, 1969; Li & Vitanyi, 1993). 

Of course, all we need to make parity easy is a search space of programs 
that process inputs sequentially and allow for internal memory and loops or 
conditional jumps. A few thousand trials will suffice to generalize perfectly to 
n-bit parity for arbitrary n, not just for special values like those used in the 
numerous GP papers on this topic (where typically n << 30). 



4 DS Theory 

The arguments above emphasize DS’ ability to deal with general algorithm spaces 
as opposed to DPRL’s limited spaces of reactive mappings. Which theoretical 
results apply to this case? 

Non-incremental & Deterministic. We know that Levin Search (LS) 
(Levin, 1973, 1984; Li & Vitanyi, 1993) is optimal in deterministic and nonin- 
cremental settings, that is, in cases where during the search process there are no 
intermediate reinforcement signals indicating the quality of suboptimal solutions. 
LS generates and tests computable solution candidates s in order of their Levin 
complexities 

Kt{s) = \m\i{—logD p{q) + log t{q, s)}. 
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where program q computes s in t{q, s) time steps, and Dp{q) is q's Solomonoff- 
Levin probability (Levin, 1973, 1984; Li & Vitanyi, 1993). Now suppose some 
algorithm A is designed to find the x solving 4>{x) = y, where (() is a computa- 
ble function mapping bitstrings to bitstrings. For instance, x may represent a 
solution to a maze task implemented by = 1 iff x leads to the goal. Let n 
be a measure of problem size (such as the number of fields in the maze), and 
suppose A needs at most 0(/(n)) steps (/ computable) to solve problems of 
size n. Then LS also will need at most 0{f{n)) steps (Levin, 1973, 1984; Li & 
Vitanyi, 1993). Unlike GP etc., LS has a principled way of dealing with unknown 
program runtimes — time is allocated in an optimal fashion. There are systems 
that use a probabilistic LS variant to discover neural nets that perfectly gene- 
ralize from extremely few training examples (Schmidhuber, 1995, 1997), and LS 
implementations that learn to use memory for solving maze tasks unsolvable by 
DPRL due to highly ambiguous inputs (Schmidhuber et ah, 1997b). 

Almost all work in traditional RL, however, focuses on incremental settings 
where continual policy improvement can bring more and more reward per time, 
and where experience with suboptimal solution candidates helps to find bet- 
ter ones. Adaptive Levin Search (ALS) (Schmidhuber et ah, 1997b; Wiering & 
Schmidhuber, 1996) extends LS in this way: whenever a new candidate is more 
successful than the best previous one, the underlying probability distribution is 
modified to make the new candidate more likely, and a new LS cycle begins. This 
guarantees that the search time spent on each incremental step (given a parti- 
cular probability distribution embodying the current bias) is optimal in Levin’s 
sense. 

It does not guarantee, though, that the total search time is spent optimally, 
because the probability adjustments themselves may have been suboptimal. ALS 
cannot exploit arbitrary regularities in solution space, because the probability 
modification algorithm itself is fixed. A machine learning researcher’s dream 
would be an incremental RL algorithm that spends overall search time optimally 
in a way comparable to non-incremental LS’s. 

Incremental & Deterministic. Which theoretical results exist for incre- 
mental DS in general program spaces? Few, except for the following basic, almost 
trivial one. Suppose the environment allows for separating the search phase into 
repeatable, deterministic trials such that each trial with a given policy yields the 
same reward. Now consider stochastic hill-climbing (SHC), one of the simplest 
DS methods: 

1 . Initialize vector-valued policy p, set variables BestPolicy := p, BestResult 
:= — oo. 2 . Measure reward R obtained during a trial with actions executed 
according to p. 3 . If BestResult > R then p := BestPolicy, else BestPolicy := p 
and BestResult := R. 4 . Modify p by random mutation. Go to 2 . 

If (1) both environment and p are deterministic, and if (2) the environment 
and all other variables modified by p (such as internal memory cells whose con- 
tents may have changed due to actions executed according to p) are reset after 
each trial, then the procedure above will at least guarantee that performance 
cannot get worse over time. If step 4 allows for arbitrary random mutations and 
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the number of possible policies is finite then we can even guarantee convergence 
on an optimal policy with probability 1, given infinite time. Of course, if the 
random mutations of step 4 are replaced by a systematic enumeration of all pos- 
sible mutations, then we also will be able to guarantee finite (though exponential 
in problem size) time^. Similar statements can be made about alternative DS 
methods such as GP or evolutionary programming. 

Of course, in real-world settings no general method can be expected to obtain 
optimal policies within reasonable time. Typical DS practitioners are usually 
content with suboptimal policies though. The next section, however, will address 
even more fundamental problems of DS. 



5 DS: Problems with Unknown Delays and Stochasticity 

Overview. As mentioned above, DS in policy space does not require certain 
assumptions about the environment necessary for traditional RL. For instance, 
while the latter is essentially limited to memory-free, reactive mappings from 
inputs to actions, DS can be used to search among fairly arbitrary, complex, 
event-memorizing programs (using memory to deal with partial observability), 
at least in simulated, deterministic worlds. In realistic settings, however, reliable 
policy evaluations are complicated by (a) unknown delays between action se- 
quences and observable effects, and (b) stochasticity in policy and environment. 
Given a limited life-time, how much time should a direct policy searcher spend 
on policy evaluations to obtain reliable statistics? Despite the fundamental na- 
ture of this question it has not received much attention yet. Here I evaluate an 
efficient approach based on the success-story algorithm (SSA) . It provides a ra- 
dical answer prepared for the worst case: it never stops evaluating any previous 
policy modification except those it undoes for lack of empirical evidence that 
they have contributed to lifelong reward accelerations. I identify SSA’s funda- 
mental advantages over traditional DS on problems involving unknown delays 
and stochasticity. 

The problem. In realistic situations DS exhibits several fundamental draw- 
backs: ( 1 ) Often there are unknown temporal delays between causes and effects. 
In general we cannot be sure that a given trial was long enough to observe all 
long-term rewards/punishments caused by actions executed during the trial. We 
do not know in advance how long a trial should take. ( 2 ) The policy may be 
stochastic, i.e., the learner’s actions are selected nondeterministically according 
to probability distributions conditioned on the policy. Stochastic policies are wi- 
dely used to prevent learners from getting stuck. Results of policy evaluations, 
however, will then vary from trial to trial. ( 3 ) Environment and reward may 
be stochastic, too. And even if the environment is deterministic it may appear 
stochastic from an individual learner’s perspective, due to partial observability. 

Time is a scarce resource. Hence all direct methods face a central question: to 
determine whether some policy is really useful in the long run or just appears to 

^ Note that even for the more limited DPRL algorithms until very recently there have 
not been any theorems guaranteeing finite convergence time (Kearns & Singh, 1999). 
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be (e.g., because the current trial was too short to encounter a punishing state, 
or because it was a lucky trial), how much time should the learner spend on 
its evaluation? In particular, how much time should a single trial with a given 
policy take, and how many trials are necessary to obtain statistically significant 
results without wasting too much time? Despite the fundamental nature of these 
questions not much work has been done to address them. 

Basic idea. There is a radical answer to such questions: Evaluate a previous 
policy change at any stage of the search process by looking at the entire time in- 
terval that has gone by since the change occurred — at any given time aim to use 
all the available empirical data concerning long-term policy-dependent rewards! 
A change is considered “good” as long as the average reward per time since its 
creation exceeds the corresponding ratios for previous “good” changes. Changes 
that eventually turn out to be “bad” get undone by an efficient backtracking 
scheme called the success-story algorithm (SSA) . SSA always takes into account 
the latest information available about long-term effects of changes that have 
appeared “good” so far ( “bad” changes, however, are not considered again) . Ef- 
fectively SSA adjusts trial lengths retrospectively: at any given time, trial starts 
are determined by the occurrences of the remaining “good” changes representing 
a success story. The longer the time interval that went by since some “good” 
change, the more reliable the evaluation of its true long-term benefits. No trial of 
a “good” change ever ends unless it turns out to be “bad” at some point. Thus 
SSA is prepared for the worst case. For instance, no matter what’s the maximal 
time lag between actions and consequences in a given domain, eventually SSA’s 
effective trials will encompass it. 

What’s next. The next section will explain SSA in detail, and show how 
most traditional direct methods can be augmented by SSA. Then we will ex- 
perimentally verify SSA’s advantages over traditional direct methods in case of 
noisy performance evaluations and unknown delays. 

6 Success-Story Algorithm (SSA) 

Here we will briefly review basic principles (Schmidhuber et al., 1997a). An agent 
lives in environment E from time 0 to unknown time T. Life is one-way: even if 
it is decomposable into numerous consecutive “trials”, time will never be reset. 
The agent has a policy POL (a set of modifiable parameters) and possibly an 
internal state S (e.g., for short-term memory that can be modified by actions). 
Both S and POL are variable dynamic data structures influencing probabilities 
of actions to be executed by the agent. Between time 0 and T, the agent repeats 
the following cycle over and over again {A denotes a set of possible actions): 

REPEAT: 

select and execute some a & A with conditional probability 

P{a I POL,S,E).^ 

Action a will consume time and may change E, S, and POL. 

^ Instead of using the expression policy for the conditional probability distribution P 
itself we reserve it for the agent’s modifiable data structure POL. 
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Learning algorithms (LAs) . LA C A is the set of actions that can modify 
POL. LA’s members are called learning algorithms (LAs). Previous work on 
metalearning (Schmidhuber et ah, 1997a, 1997b) left it to the system itself to 
compute execution probabilities of LAs by constructing and executing POL- 
modifying algorithms. In this paper we will focus on a simpler case where all 
LAs are externally triggered procedures. Formally this means that P{a G LA | 
POL, S,E) = 1 if E is in a given “policy modification state” determined by the 
user, and P{a G LA | POL, S,E) = 0 otherwise. For instance, in some of the 
illustrative experiments to be presented later, SHC’s mutation process will be 
the only LA. Alternatively we may use LAs that are in fact GP-like crossover 
strategies, or a wide variety of other policy-modifying algorithms employed in 
traditional DS methods. 

Checkpoints. The entire lifetime of the agent can be partitioned into time 
intervals separated by special times called checkpoints. In general, checkpoints 
are computed dynamically during the agent’s life by certain actions in A executed 
according to POL itself (Schmidhuber et ah, 1997a). In this paper, however, for 
simplicity all checkpoints will be set in advance. The agent’s fc-th checkpoint 
is denoted Ck- Checkpoints obey the following rules: (1) Vfc 0 < Ck < T. (2) 
Vj < k Cj < Ck- (3) Except for the first, checkpoints may not occur before at 
least one POL-modification has been computed by some LA since the previous 
checkpoint. 

Sequences of POL-modifications (SPMs). SPMfc denotes the sequence 
of POL-modifications computed by LAs in between Ck and c^+i. Since LA exe- 
cution probabilities may depend on POL, POL may in principle influence the 
way it modifies itself, which is interesting for things such as metalearning. In 
this paper’s experiments, however, we will focus on the case where exactly one 
LA (a simple mutation process like the one used in stochastic hill-climbing) is 
invoked in between two subsequent checkpoints. 

Reward and goal. Occasionally E provides real-valued reward. The cumu- 
lative reward obtained by the agent between time 0 and time t > 0 is denoted 
R{t), where i?(0) = 0. Typically, in large (partially observable) environments, 
maximizing cumulative expected reinforcement within a limited life-time would 
be too ambitious a goal for any method. Instead designers of direct policy se- 
arch methods are content with methods that can be expected to find better 
and better policies. But what exactly does “better” mean in our context? Our 
agent’s obvious goal at checkpoint t is to generate POL-modifications accelera- 
ting reward intake: it wants to let 0 xceed the current average speed 

of reward intake. But to determine this speed it needs a previous point t' < t to 
compute ^ . How can t' be specified in a general yet reasonable way? Or, 

to rephrase the central question of this paper: if life involves unknown temporal 
delays between action sequences and their observable effects and/or consists of 
many successive “trials” with nondeterministic, uncertain outcomes, how long 
should a trial last, and how many trials should the agent look back into time to 
evaluate its current performance? 
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The success-story algorithm (to be introduced next) addresses this question 
in a way that is prepared for the worst case: once a trial of a new policy change 
has begun it will never end unless the policy change itself gets undone (by a 
simple backtracking mechanism which ensures that trial starts mark long-term 
reward accelerations). 

Enforcing success stories. Let V denote the agent’s time-varying set of 
past checkpoints that have been followed by long-term reward accelerations. 
Initially V is empty. Vk denotes the fc-th element of V in ascending order. The 
success-story criterion (SSC) is satisfied at time t if either V is empty (trivial 
case) or if 

R(t) - E(0) ^ R{t) - R{vi) ^ R{t) - R{v 2 ) ^ ^ R{t) - R{v\y\) 

t — Q t — Vi t — V2 t — V\Y\ 

SSC demands that each checkpoint in V marks the beginning of a long-term 
reward acceleration measured up to the current time t. SSC is achieved by the 
success-story algorithm (SSA) which is invoked at every checkpoint: 

1. WHILE SSC is not satisfied: Undo all POL modifications made 

since the most recent checkpoint in V, and remove that checkpoint from 

U. 

2. Add the current checkpoint to V. 

“Undoing” a modification means restoring the preceding POL — this requires 
storing past values of POL components on a stack prior to modification. Thus 
each POL modification that survived SSA is part of a bias shift generated after 
a checkpoint marking a lifelong reward speed-up: the remaining checkpoints in 
V and the remaining policy modifications represent a “success story.” 

Trials and “effective” trials. All checkpoints in V represent starts of yet 
unfinished “effective” trials (as opposed to externally defined trials with prewired 
starts and ends). No effective trial ever ends unless SSA restores the policy to 
its state before the trial started. The older some surviving SPM, the more time 
will have passed to collect statistics concerning its long-term consequences, and 
the more stable it will be if it is indeed useful and not just there by chance. 

Since trials of still valid policy modifications never end, they embody a princi- 
pled way of dealing with unknown reward delays. No matter what’s the maximal 
causal delay in a given environment, eventually the system’s effective trials will 
encompass it. 

Metalearning? An interesting example of long delays between actions and 
effects is provided by metalearning (learning a learning algorithm or credit assig- 
nment algorithm) (Lenat, 1983; Schmidhuber, 1987; Schmidhuber et ah, 1997a). 
For instance, suppose some “metalearning action sequence” changes a learner’s 
credit assignment strategy. To evaluate whether the change is good or bad, ap- 
parently we need something like a “meta-trial” subsuming several lower-level 
“normal” trials in which instances of additional policy changes produced by the 
modified learning strategy somehow get evaluated, to collect evidence concer- 
ning the quality of the modified learning strategy itself. Due to their very nature 
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such meta-trials will typically consume a lot of time. SSA, however, addresses 
this issue in a natural and straight-forward way. There is no explicit difference 
between trials and meta-trials. But new effective trials do get opened within 
previously started, yet unfinished effective trials. What does this mean? It me- 
ans that earlier SPMs automatically get evaluated as to whether they set the 
stage for later “good” SPMs. For instance, SSA will eventually discard an early 
SPM that changed the policy in a way that increased the probability of certain 
later SPMs causing a waste of time on evaluations of useless additional policy 
changes. That is, SSA automatically measures (in terms of reward/time ratios 
affected by learning and testing processes) the impact of early learning on later 
learning: SSA prefers SPMs making “good” future SPMs more likely. Given ac- 
tion sets that allow for composition of general credit assignment strategies from 
simple LAs, SSA will prefer probabilistic learning algorithms leading to better 
probabilistic learning algorithms. And it will end meta-trials as soon as they 
violate the constraints imposed by the success-story criterion, just like it does 
with “normal” trials. 

Implementing SSA. SSA guarantees that SSC will be satisfied after each 
checkpoint, even in partially observable, stochastic environments with unknown 
delays. (Of course, later SSA invocations using additional, previously unavai- 
lable, delayed information may undo policy changes that survived the current 
checkpoint.) Although inequality (1) contains \V\ fractions, SSA can be imple- 
mented efficiently: only the two most recent still valid sequences of modifications 
(those “on top of the stack”) need to be considered at a given time in an SSA 
invocation. A single SSA invocation, however, may invalidate many SPMs if 
necessary. 

7 Illustration: How SSA Augments DS 

Methods for direct search in policy space, such as stochastic hill-climbing (SHC) 
and genetic algorithms (GAs), test policy candidates during time-limited trials, 
then build new policy candidates from some of the policies with highest evaluati- 
ons observed so far. As mentioned above, the advantage of this general approach 
over traditional RL algorithms is that few restrictions need to be imposed on 
the nature of the agent’s interaction with the environment. In particular, if the 
policy allows for actions that manipulate the content of some sort of short-term 
memory then the environment does not need to be fully observable — in princi- 
ple, direct methods such as Genetic Programming (Gramer, 1985; Banzhaf et ah, 
1998), adaptive Levin Search (Schmidhuber et ah, 1997b), or Probabilistic In- 
cremental Program Evolution (Salustowicz & Schmidhuber, 1997), can be used 
for searching spaces of complex, event-memorizing programs or algorithms as 
opposed to simple, memory-free, reactive mappings from inputs to actions. 

As pointed out above, a disadvantage of traditional direct methods is that 
they lack a principled way of dealing with unknown delays and stochastic policy 
evaluations. In contrast to typical trials executed by direct methods, however, an 
SSA trial of any previous policy modification never ends unless its reward/time 
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ratio drops below that of the most recent previously started (still unfinished) 
effective trial. Here we will go beyond previous work (Schmidhuber et ah, 1997a, 
1997b) by clearly demonstrating how a direct method can benefit from augmen- 
tation by SSA in presence of unknown temporal delays and stochastic perfor- 
mance evaluations. 

Task (Schmidhuber & Zhao, 1999). We tried to come up with the simplest 
task sufficient to illustrate the drawbacks of standard DS algorithms and the way 
SSA overcomes them. Hence, instead of studying tasks that require to learn com- 
plex programs setting and resetting memory contents (as mentioned above, such 
complex tasks provide a main motivation for using DS), we use a comparatively 
simple two-armed bandit problem. 

There are two arms A and B. Pulling arm A will yield reward 1000 with pro- 
bability 0.01 and reward -1 with probability 0.99. Pulling arm B will yield reward 
1 with probability 0.99 and reward -1000 with probability 0.01. All rewards are 
delayed hy 5 pulls. There is an agent that knows neither the reward distributions 
nor the delays. Since this section’s goal is to study policy search under uncer- 
tainty we equip the agent with the simplest possible stochastic policy consisting 
of a single variable p (0 < p < 1): at a given time arm A is chosen with proba- 
bility p, otherwise B is chosen. Modifying the policy in a very limited, SHC-like 
way (see below) and observing the long-term effects is the only way the agent 
may collect information that might be useful for improving the policy. Its goal is 
to maximize the entire reward obtained during its life-time which is limited to 
30,000 pulls. The maximal cumulative reward is 270300 (always choose arm A), 
the minimum is -270300 (always choose arm B). Random arm selection yields 
expected reward 0. ^ 

Obviously the task is non-trivial, because the long-term effects of a small 
change in p will be hard to detect, and will require significant statistical sampling. 

It is besides the point of this paper that our prior knowledge of the problem 
suggests a more informed alternative approach such as “pull arm A for N trials, 
then arm B for N trials, then commit to the best.” Even cleverer optimizers 
would try various assumptions about the delay lengths and pull arms in turn until 
one was statistically significantly better than the other, given a particular delay 
assumption. We do not allow this, however: we make the task hard by requiring 
the agent to learn solely from observations of outcomes of limited, SHC-like 
policy mutations (details below). After all, in partially observable environments 
that are much more complex and realistic (but less analyzable) than ours this 
often is the only reasonable thing to do. 



® The problem resembles another two-armed bandit problem for which there is an 
optimal method due to Gittins (Gittins, 1989). Our unknown reward delays, however, 
prevent this method from being applicable — it cannot discover that the current 
reward does not depend on the current input but on an event in the past. In addition, 
Gittins’ method needs to discount future rewards relative to immediate rewards. 
Anyway, this footnote is besides the point of this paper whose focus is on direct 
policy search — Gittins’ method is not a direct one. 




228 



J. Schmidhuber 



Stochastic Hill-Climbing (SHC). SHC may be the simplest incremental 
algorithm using direct search in policy space. It should be mentioned, however, 
that despite its simplicity SHC often outperforms more complex direct methods 
such as GAs (duels & Wattenberg, 1996). Anyway, SHC and more complex 
population-based direct algorithms such as GAs, CP, and evolution strategies 
are equally affected by the central questions of this paper: how many trials should 
be spent on the evaluation of a given policy? How long should a trial take?^ 

We implement SHC as follows: 1. Initialize policy p to 0.5, and real- valued 
variables BestPolicy and BestResult to p and 0, respectively. 2. If there have 
been more than 30000 — Trial Length pulls then exit {Trial Length is an integer 
constant). Otherwise evaluate p by measuring the average reward R obtained 
during the next TrialLength consecutive pulls. 3. If BestResult > R then 
p := BestPolicy, else BestPolicy := p and BestResult := R. 4. Randomly 
perturb p by adding either -0.1 or -1-0.1 except when this would lead outside the 
interval [0,1]. Go to 2 . 

Problem. Like any direct search algorithm SHC faces the fundamental que- 
stion raised in section 2: how long to evaluate the current policy to obtain sta- 
tistically significant results without wasting too much time? To examine this 
issue we vary TrialLength. Our prior knowledge of the problem tells us that 
TrialLength should exceed 5 to handle the 5-step reward delays. But due to the 
stochasticity of the rewards, much larger TrialLengths are required for reliable 
evaluations of some policy’s “true” performance. Of course, the disadvantage 
of long trials is the resulting small number of possible training exemplars and 
policy changes (learning steps) to be executed during the limited life which lasts 
just 30,000 steps. 

Comparison 1. We compare SHC to a combination of SSA and SHC, which 
we implement just like SHC except that we replace step 3 by a checkpoint (SSA- 
invocation — see section 6). 

Comparison 2. To illustrate potential benefits of policies that influence the 
way they learn we also compare to SSA applied to a primitive “self-modifying 
policy” (SMP) with two modifiable parameters: p (with same meaning as above) 
and “learning rate” 6 (initially 0). After each checkpoint, p is replaced by p-l- <5, 
then 6 is replaced by 2*6. In this sense SMP itself influences the way it changes, 
albeit in a way that is much more limited than the one in previous papers 
(Schmidhuber et ah, 1997a; Schmidhuber, 1999). If (5 = 0 then it will be randomly 
set to either 0.1 or —0.1. If <5 > 0.5 {6 < —0.5) then it will be replaced by 0.5 
(-0.5). If p > 1 (p < 0) then it will be set to 1 (0). 

One apparent danger with this approach is that accelerating the learning 
rate may result in unstable behavior. We will see, however, that SSA precisely 
prevents this from happening by eventually undoing those learning rate modifi- 
cations that are not followed by reliable long-term performance improvement. 



In fact, population-based approaches will suffer even more than simple SHC from 
unknown delays and stochasticity, simply because they need to test many policies, 
not just one. 
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Fig. 1. Total cumulative reinforcement (averaged over 100 trials) obtained by SHC 
(bottom), SSA/SHC (middle), SSA/SMP (top) for varying trial lengths. The picture 
does not change much for even longer trial lengths. 



Results. For all three methods Figure 1 plots lifelong cumulative reward 
(mean of 100 independent runs) against TrialLength varying from 10 to 600 
pulls with a step size of 10 pulls. For most values of TrialLength, SHC fails 
to realize the long-term benefits of choosing arm A. SSA/SHC, however, al- 
ways yields satisfactory results because it does not care whether TrialLength is 
much too short to obtain statistically significant policy evaluations. Instead it re- 
trospectively readjusts the “effective” trial starts: at any given checkpoint, each 
previous checkpoint in V marks the begin of a new trial lasting up to the current 
checkpoint. Each such trial start corresponds to a lifelong reward-acceleration. 
The corresponding policy modifications gain more and more empirical justifica- 
tion as they keep surviving successive SSA calls, thus becoming more and more 
stable. 

Still, SSA/SHC’s performance slowly declines with increasing TrialLength 
since this implies less possible policy changes and less effective trials due to li- 
mited life-time. SSA/SMP (comparison 2), however, does not much suffer from 
this problem since it boldly increases the learning rate as long as this is em- 
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pirically observed to accelerate long-term reward intake. As soon as this is not 
the case any longer, however, SSA prevents further learning rate accelerations, 
thus avoiding unstable behavior. This primitive type of learning algorithm self- 
modification outperforms SSA/SHC. In fact, some of the surviving effective trials 
may be viewed as ’’metalearning trials”: SSA essentially observes the long term 
effects of certain learning rates whose values are influenced by the policy its- 
elf, and undoes those that tend to cause “bad” additional policy modifications 
setting the stage for worse performance in subsequent trials. 

Trials Shorter Than Delays. We also tested the particularly interesting 
case Trial Length < 5. Here SHC and other direct methods fail completely 
because the policy tested during the current trial has nothing to do with the 
test outcome (due to the delays). The SSA/SHC combination, however, still 
manages to collect cumulative performance of around 150,000. Unlike with SHC 
(and other direct methods) there is no need for a priori knowledge about “good” 
trial lengths, exactly because SSA retrospectively adjusts the effective trial sizes. 

Comparable results were obtained with much longer delays. In particular, see 
a recent article (Schmidhuber, 1999) for experiments with much longer life-times 
and unknown delays of the order of 10® — 10^ time steps. 

A Complex Partially Observable Environment. This section’s focus 
is on clarifying SSA’s advantages over traditional DS in the simplest possible 
setting. It should be mentioned, however, that there have been much more chal- 
lenging SSA applications in partially observable environments, which represent 
a major motivation of direct methods because most traditional RL methods are 
not applicable here. For instance, a previous paper (Schmidhuber et ah, 1997a) 
describes two agents A and B living in a partially observable 600 x 500 pixel 
environment with obstacles. They learn to solve a complex task that could not 
be solved by various TD(A) Q-learning variants (Lin, 1993). The task requires 
(1) agent A to find and take a key “key A”; (2) agent A go to a door “door 
A” and open it for agent B; (3) agent B to enter through “door A”, find and 
take another key “key B”; (4) agent B to go to another door “door B” to open 
it (to free the way to the goal); (5) one of the agents to reach the goal. Both 
agents share the same design. Each is equipped with limited “active” sight: by 
executing certain instructions, it can sense obstacles, its own key, the correspon- 
ding door, or the goal, within up to 50 pixels in front of it. The agent can also 
move forward, turn around, turn relative to its key or its door or the goal. It 
can use short-term memory to disambiguate inputs — unlike Jaakkola et al.’s 
method (1995), ours is not limited to finding suboptimal stochastic policies for 
POEs with an optimal solution. Each agent can explicitly modify its own policy 
via special actions that can address and modify the probability distributions 
according to which action sequences (or “subprograms”) are selected (this also 
contributes to making the set-up highly non-Markovian). Reward is provided 
only if one of the agents touches the goal. This agent’s reward is 5.0; the other’s 
is 3.0 (for its cooperation — note that asymmetric reward also introduces com- 
petition). Due to the complexity of the task, in the beginning the goal is found 
only every 300,000 actions on average (including actions that are primitive LAs 




Sequential Decision Making Based on Direct Search 231 



and modify the policy). No prior information about good initial trial lengths 
is given to the system. Through self-modifications and SSA, however, within 
130,000 goal hits (10® actions) the average trial length decreases by a factor of 
60 (mean of 4 simulations) . Both agents learn to cooperate to accelerate reward 
intake, by retrospectively adjusting their effective trial lengths using SSA. 

While this previous experimental research has already demonstrated SSA’s 
applicability to large-scale partially observable environments, a study of why it 
performs well has been lacking. In particular, unlike the present work, previous 
work (Schmidhuber et ah, 1997a) did not clearly identify SSA’s fundamental 
advantages over alternative DS methods. 

8 Relation to Market Models 

There is an interesting alternative class of RL systems that also combines aspects 
of DS and traditional RL. It exploits ideas from the field of economy that seem 
naturally applicable in the context of multiagent RL, and (unlike Q-learning 
etc.) are not necessarily limited to learning reactive behavior. In what follows 
I will ignore the extensive original financial literature and briefly review solely 
some work on RL economies instead. Then I will relate this work to SSA. 

Classifier Systems and Bucket Brigade. The first RL economy was 
Holland’s meanwhile well-known bucket brigade algorithm for classifier systems 
(Holland, 1985). Messages in form of bitstrings of size n can be placed on a glo- 
bal message list either by the environment or by entities called classifiers. Each 
classifier consists of a condition part and an action part defining a message it 
might send to the message list. Both parts are strings out of {0, 1, _}" where the 

serves as a ‘don’t care’ if it appears in the condition part. A non-negative real 
number is associated with each classifier indicating its ‘strength’. During one cy- 
cle all messages on the message list are compared with the condition parts of all 
classifiers of the system. Each matching classifier computes a ‘bid’ based on its 
strength. The highest bidding classifiers may place their message on the message 
list of the next cycle, but they have to pay with their bid which is distributed 
among the classifiers active during the last time step which set up the triggering 
conditions (this explains the name bucket brigade). Certain messages result in 
an action within the environment (like moving a robot one step). Because some 
of these actions may be regarded as ’useful’ by an external critic who can give 
payoff by increasing the strengths of the currently active classifiers, learning may 
take place. The central idea is that classifiers which are not active when the en- 
vironment gives payoff but which had an important role for setting the stage for 
directly rewarded classifiers can earn credit by participating in ‘bucket brigade 
chains’. The success of some active classifier recursively depends on the success 
of classifiers that are active at the following time ticks. Unsuccessful classifiers 
are replaced by new ones generated with the help of GAs. 

Holland’s original scheme is similar to DPRL algorithms such as Q-learning in 
the sense that the bids of the agents correspond to predictions of future reward. 
On the other hand, the scheme does not necessarily require full environmental 
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observability. Instead it partly depends on DS-like evolutionary pressure absent 
in traditional RL. E.g., bankrupt agents who spent all their money are removed 
from the system. 

Holland’s approach, however, leaves several loopholes that allow agents to 
make money without contributing to the success of the entire system. This led 
to a lot of follow-up research on more stable RL classifier economies (Wilson, 
1994, 1995; Weiss, 1994; Weiss & Sen, 1996) and other related types of RL 
ecomomies (see below). This work has closed some but possibly not all of the 
original loopholes. 

Prototypical Self-referential Associating Learning Mechanisms. Pa- 
ges 23-51 of earlier work (Schmidhuber, 1987) are devoted to systems called 
PSALMl - PSALMS. Like in Holland’s scheme, competing/cooperating RL a- 
gents bid for executing actions. Winners may receive external reward for achie- 
ving goals. Unlike in Holland’s scheme, agents are supposed to learn the credit 
assignment process itself (metalearning). For this purpose they can execute ac- 
tions for collectively constructing / connecting / modifying agents, for assigning 
credit (reward) to agents, and for transferring credit from one agent to another. 
To the best of my knowledge, PSALMs are the first machine learning systems 
that enforce the important constraint of total credit conservation (except for 
consumption and external reward): no agent can generate money from nothing. 
This constraint is not enforced in the original bucket brigade economy, where 
new agents enter with freshly created money (this may cause inflation and other 
problems). Reference (Schmidhuber, 1987) also inspired the neural bucket bri- 
gade (NBB), a slightly more recent but less general approach enforcing money 
conservation, where money is ’’weight substance” of a reinforcement learning 
neural net (Schmidhuber, 1989). 

Hayek Machine (Baum & Durdanovic, 1998). PSALMS does not strictly 
enforce individual property rights. For instance, agents may steal money from 
other agents and temporally use it in a way that does not contribute to the 
system’s overall progress. Hayek machines are constitutional economies that ap- 
parently do not suffer from such “parasite problems” (although there is no proof 
yet at the moment). Hayek2 (Baum & Durdanovic, 1998) — the most recent 
Hayek variant — is somewhat reminiscent of PSALMS (agents may construct 
new agents, and there is money conservation) , but avoids several loopholes in the 
credit assignment process that may allow some agent to profit from actions that 
are not beneficial for the system as a whole. Property rights of agents are stric- 
tly enforced. Agents can create children agents and invest part of their money 
into them, and profit from their success. Hayek2 learned to solve rather complex 
blocks world problems (Baum & Durdanovic, 1998). The authors admit, howe- 
ver, that possibly not all potential loopholes rewarding undesired behavior have 
been closed by Hayek2. 

COINs/ Wonderful Life Utility (Wolpert, Turner, & Frank, 1999). Per- 
haps the only current RL economy with a sound theoretical foundation is the 
collective Intelligence (COIN) by Wolpert, Turner & Frank (1999). A COIN par- 
titions its set of agents into “subworlds.” Each agent of a given subworld shares 
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the same local utility function. Global utility is optimized by provably making 
sure that no agent in some subworld can profit if the system as a whole does not 
profit. COINs were successfully used in an impressive routing application. 

SSA and Market Models. One way of viewing SSA in an economy-inspired 
framework is this: the current “credit” of a policy change equals the reward since 
its creation divided by the time since its creation. A policy change gets undone 
as soon as its credit falls below the credit of the most recent change that has not 
been undone yet. After any given SSA invocation the yet undone changes reflect 
a success-story of long-term credit increases. 

To use the parent/child analogy (Baum & Durdanovic, 1998): at a given 
time, any still valid policy change may be viewed as a (grand-)parent of later 
policy changes for which it set the stage. Children that are more profitable than 
their ancestors protect the latter from being undone. In this way the ancestors 
profit from making successful children. Ancestors who increase the probability 
of non-profltable offspring, however, will eventually risk oblivion. 

9 Conclusion 

Direct search (DS) in policy space offers several advantages over traditional 
reinforcement learning (RL). For instance, DS does not need a priori informa- 
tion about world states and the topology of their interactions. It does not care 
whether the environment is fully observable. It makes hierarchical credit assign- 
ment conceptually trivial, and also allows for many alternative, non-hierarchical 
types of abstract credit assignment. 

Existing DS methods, however, do suffer from fundamental problems in pre- 
sence of environmental stochasticity and/or unknown temporal delays between 
actions and observable effects. In particular, they do not have a principled way 
of deciding when to stop policy evaluations. 

Stochastic policy evaluation by the success-story algorithm (SSA) differs from 
traditional DS. SSA never quits evaluating any previous policy change that has 
not yet been undone for lack of empirical evidence that it has contributed to a 
lifelong reward acceleration. Each invocation of SSA retrospectively establishes 
a success history of surviving self- modifications: only policy changes that have 
empirically proven their long-term usefulness so far get another chance to justify 
themselves. This stabilizes the “truly useful” policy changes in the long run. 

Unlike many traditional value function-based RL methods, SSA is not limited 
to fully observable worlds, and does not require discounting of future rewards. 
It shares these advantages with traditional DS algorithms. Unlike stochastic 
hill-climbing and other DS methods such as genetic algorithms, however, SSA 
does not heavily depend on a priori knowledge about reasonable trial lengths 
necessary to collect sufficient statistics for estimating long-term consequences 
and true values of tested policies. 

On the other hand, many DS methods can be augmented by SSA in a straight- 
forward way: just measure the time used up by all actions, policy modifications, 
and policy tests, and occasionally insert checkpoints that invoke SSA. In this 
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sense SSA’s basic concepts are not algorithm-specific — instead they refiect a 
novel, general way of thinking about how “true” performance should be measured 
in RL systems using DS in policy space. 

Since SSA automatically collects statistics about long-term effects of earlier 
policy changes on later ones, it is of interest for improving the credit assignment 
method itself (Schmidhuber et ah, 1997a). 

Although the present paper’s illustrative SSA application is much less com- 
plex than our previous ones, it is the first to provide insight into SSA’s fun- 
damental advantages over traditional DS methods in case of stochastic policy 
evaluations and unknown temporal delays between causes and effects. 

Market models are similar to traditional DPRL in the sense that the bids of 
their agents correspond to predictions of future reward used in DPRL algorithms 
such as Q-learning. On the other hand, they are similar to DS in the sense 
that they incorporate evolutionary pressure and do not obviously require full 
environmental observability and Markovian conditions but can be applied to 
agents whose policies are drawn from general program spaces. For example, 
bankrupt agents who spent all their money are usually replaced by mutations of 
more successful agents. This introduces Darwinian selection absent in traditional 
DPRL. 

SSA shares certain aspects of market models. In particular, at a given time 
any existing policy modification’s reward/time ratio measured by SSA may be 
viewed as an investment into the future. The policy modification will fail to 
survive once it fails to generate enough return (including the reward obtained 
by its own “children”) to exceed the corresponding investment of its “parent”, 
namely, the most recent previous, still existing policy modification. 

Future research will hopefully show when to prefer pure market models over 
SSA and vice versa. For instance, it will be interesting to study whether the 
former can efficiently deal with long, unknown causal delays and highly stochastic 
policies and environments. 
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1 Introduction 



Sequential behaviors (sequential decision processes) are fundamental to cognitive 
agents. The use of reinforcement learning (RL) for acquiring sequential behaviors 
is appropriate, and even necessary, when there is no domain-specific a priori 
knowledge available to agents (Sutton 1995, Barto et al 1995, Kaelbling et al 
1996, Bertsekas and Tsitsiklis 1996, Watkins 1989). Given the complexity and 
differing scales of events in the world, there is a need for hierarchical RL that 
can produce action sequences and subsequences that correspond with domain 
structures. This has been demonstrated time and again, in terms of facilitating 
learning and/or dealing with non-Markovian dependencies, e.g., by Dayan and 
Hinton (1993), Kaelbling (1993), Lin (1993), Wiering and Schmidhuber (1998), 
Tadepalli and Dietterich (1997), Parr and Russell (1997), Dietterich (1997), and 
many others. Different levels of action subsequencing correspond to different 
levels of abstraction. Thus, subsequencing facilitates hierarchical planning as 
studied in traditional Al as well (Sacerdoti 1974, Knoblock, Tenenberg, and 
Yang 1994, Sun and Sessions 1998). 

However, we noticed the shortcomings of structurally pre-determined hierar- 
chies in RL (whereby structures are derived from a priori domain knowledge and 
thus fixed). The problems of such hierarchies include cost (because it is costly 
to obtain a priori domain knowledge to form hierarchies), inflexibility (because 
the characteristics of the domain for which fixed hierarchies are built can change 
over time), and lack of generality (because domain-specific hierarchies most li- 
kely vary from domain to domain). Even when limited learning is used to fine 
tune structurally pre-determined hierarchies (Parr and Russell 1997, Dietterich 
1997), some of these problems persist. 

A more general approach would be to automatically develop hierarchies (of 
actions), from scratch, based only on some generic structures (such as a fixed 
number of levels) and to automatically tailor details of structures and their para- 
meters with reinforcement learning. What this process amounts to is automati- 
cally segmenting action sequences (self-segmentation) and creating a hierarchical 
organization of action subsequences (which can be viewed as “macro-actions” or 
“subroutines”). Subsequences resulting from self-segmentation may be reused 
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throughout a sequence or by different sequences (i.e, shared among different 
tasks). ^ In the end we have a hierarchical organization of sequential behaviors. 

Although segmentation is evidently important, some existing hierarchical RL 
models either do not involve segmentation (i.e., using pre-determined hierarchies 
or segmentations) or involve only domain-specific segmentation processes. In 
contrast, we will address the issue of learning segmentation without (or with 
little) domain-specific knowledge or procedures (Ring 1991, Schmidhuber 1992, 
Weiss 1995, Thrun and Schwartz 1995). 

A difficult issue that we face in RL is how to deal with (non-Markovian) tem- 
poral dependencies (that is, situations in which state transition probabilities are 
determined not only by the current state and action but also by previous ones, 
due to partial observability; Lin 1993, McCallum 1996, Kaelbling et al 1996). ^ 
Such dependencies can create problems in reinforcement learning. Through the 
use of reinforcement learning at different levels, we may seek out proper confi- 
gurations of non-Markovian temporal dependencies, with the goal of reducing 
dependencies through segmenting at proper places at different levels so as to 
facilitate the learning of the overall task. 

Therefore, we need to deal with the following issues: (1) automatic (self) 
segmentation of sequences, with each segment to be handled differently by a 
different module, to reduce or remove non-Markovian temporal dependencies; 
(2) automatic development of common subsequences (that is, subroutines or 
subtasks) that can be used in various places of a sequence and/or in different 
sequences, to simplify non-Markovian dependencies and to form compact repre- 
sentations; (3) moreover, segmentation without (or with little) a priori domain- 
specific knowledge as embodied in domain-specific structures, domain-specific 
ways of creating such structures, and so on (because relying on a priori know- 
ledge leads to a lack of generality) . 

2 Review of RL 

Reinforcement learning (RL) has received a great deal of attention recently. RL 
can be viewed as an on-line variation of value-iteration dynamic programming 
(DP). DP in general can be defined as follows: there is a 5-tuple: (S', U, T, P, g), in 
which S is the set of state, U is the set of actions, and T is the probabilistic state 
transition function that maps the current state and the current action to a new 
state in the next time step. P is the (reactive) stationary policy that determines 
the action at the current time step given the current state: P{st) = Ut- g is the 

^ When adaptive formation of hierarchies is carried out in a multiple-task context, the 
learning of different sequences can influence each other, and common subsequences 
may emerge. As a result, different tasks may share some subsequences (subroutines; 
Thrun and Schwartz 1995). 

^ In such situations, optimal actions at one point may be dependent on not only 
the current state but also states and actions occurring some time ago. See Sondik 
(1978), Monahan (1982), and Puterman (1994) for treatments of POMDPs (partially 
observable Markovian decision processes. 
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cost (or reward) function that maps certain states and/or actions to real values. 
That is, in DP, there is a discrete-time system, the state transitions of which 
depend on actions performed by an agent. In probabilistic terms, a Markovian 
process is in the working in determining a new state (from state transition) after 

an action is performed: prob{st+i\st,Ut, ) = prob{st+i\st,Ut) = 

Pst,st+i (wt), where Ut is determined based on the policy P (assuming ut = P{st)). 
In this process, costs (or rewards) can occur for certain states and/or actions. 
Normally the costs/rewards accumulate additively, with or without a discount 
factor. 

The cumulative cost/reward estimate (which we will denote as J) that re- 
sults from following an optimal policy of actions satisfies the Bellman optimality 
equation: 

J(st)= max V ps,,st+i{u){g{st) +7J{st+i)) 

Sf+i^S 

where g denotes cost/reward (which is assumed to be a function of states), St 
is any state and Sj+i is the new state resulting from action u. Or, using the 
notation Q{st,ut) (where max„(5(s,M) = T(s)), we have 

Q{st,ut)= V Ps„st+i{9{st) + 7ma.xQ{st+i,u)) 

^ ^ u 



Based on the Bellman optimality equation, there are a number of on-line 
(RL) algorithms for learning Q or J functions. One is the Q-learning algorithm 
of Watkins (1989). The updating is done completely on-line, without explicitly 
using probability estimates: 

Q(st, Ut) := (1 - a)Q{st,Ut) + a{g{st) + 7 max(Q(st+i, rtt+i))) 



= Q{st,Ut) + a{g{st) + jma,x{Q{st+i,Ut+i)) - Q{st,Ut)) 

Mt + l 

where Ut is determined by an action policy that allows sufficient exploration 
(Bertsekas and Tsitsiklis 1996). For example, it can be based on the Boltzmann 
distribution: 

gQ(St,“t)/T 

where r is the temperature that determines the degree of randomness in action 
selection (other policies are also possible; Bertsekas and Tsitsiklis 1996). The 
updating is done based on actual state transition instead of transition proba- 
bilities; that is, on-line “simulation” is performed. With enough sampling, the 
transition frequency from St and Ut to Sj+i should approach Pst,st+ii'^t)j and 
thus provides an estimation. Therefore, the result will be the same values as in 
the earlier specification of Q values. Such learning allows completely autonomous 
learning from scratch, without a priori domain knowledge. 
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3 The SSS Algorithm 

Below we will present an approach for automatically developing hierarchical rein- 
forcement learning systems, the SSS algorithm, which stands for Self Segmen- 
tation of Sequences. Different from usually Markovian domains for Q-learning, 
we will deal with non-Markovian domains (partially observable Markovian de- 
cision processes), where an agent observes local information that constitutes an 
observational state (or an observation; Bertsekas and Tsitsiklis 1996) but not a 
true state. The true state may be determined from the history of observational 
states. We create a hierachical structure that involves multiple modules (each 
with either little or no memory) for dealing with such domains, without a pri- 
ori domain knowledge. (In the following discussion, the term “state” in general 
refers to observational states.) 

3.1 The General Idea 

The general idea is as follows: There are a number of individual action modules 
(referred to as Q modules below) . Each of them selects actions to be performed 
at each step. However, for each Q module, there is also a corresponding controller 
CQ, which determines at each step whether the Q module should continue or 
relinquish control. When a Q module currently in control relinquishes its control, 
a higher-level controller, AQ, will decide which Q module should take over next 
from the current point on. So, in any given state, a pair of Q and CQ should 
be in control. When the CQ selects continue, the Q will select an action with 
regard to the current state that will affect the environment and thus generate 
reinforcement from the environment. When the CQ selects end, the control is 
returned to AQ, which will then proceed immediately to select (with regard to 
the current state) another CQ (and its corresponding Q) to take over. This cycle 
then repeats itself (cf. the idea of “options”; Precup et al 1998). 

Each component in this system is learned from scratch (in contrast to e.g. 
Precup et al 1998, Dietterich 1997). Let us look into how each component learns: 

— Each Q module tries to receive as much reinforcement as possible before it 
is forced to give up. i.e., performs optimization within a local segment (a set 
of adjacent states). However, local segments change over time and thus the 
world is not static for a Q module. That is, the learning of self-segmentation 
(determined jointly by CQ and AQ) is concurrent with the local learning of 
Q modules. The non-static situation adds to the complexity of learning. 

— Each CQ tries to determine at the current point whether it is more advanta- 
geous to terminate the corresponding Q module or to let it continue, in terms 
of maximizing the overall reinforcement. ^ A CQ considers whether or not 

® If the Q module is continued, the overall reinforcement is the (discounted) sum of 
all the reinforcements that will be received by the Q module plus the (discounted) 
sum of reinforcement that will be received by subsequent Q’s (since the active Q 
module will be terminated eventually). If the Q module is terminated, the overall 
reinforcement is the (discounted) sum of all the reinforcement that will be received 
by subsequent Q’s. 
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to give up based on which way will lead to more overall reinforcement. This 
is partially determined by the performance of all Q’s executed subsequently 
and by the selection made by AQ. So, the CQ has to deal with the problem 
of dynamic interaction. 

— AQ tries to learn the selection of Q modules at various points (by an abstract 
control action), by evaluating which selection will lead to more overall rein- 
forcement. Again this is not a static situation: which states becomes decision 
points for AQ is partially determined by CQ’s (which determines when to 
give control back to AQ, which is in turn determined by Q’s, which select 
actions that generate reinforcement). As each Q and CQ pair learns, the 
selection by AQ may have to change also. 

This two-level structure can be extended to more than two levels. 



3.2 The Algorithm 

The Overall Specification Let s denote the observational state of an agent 
at a particular moment (not necessarily a complete description). Assume rein- 
forcements are associated with the current observational state: g{s). We will 
focus on Q-learning to illustrate our idea (although other RL algorithms are 
also applicable). Let us look into a minimal structure: a two level hierarchy, in 
which there are three types of learning modules: 

~ Individual action module Q: Each performs actions and learns through Q- 
learning. 

— Individual controller CQ: Each CQ learns when a Q module (corresponding 
to the CQ) should continue its control and when it should give up the control. 
The learning is accomplished through (separate) Q-learning. 

— Abstract controller AQ: It performs and learns abstract control actions, that 
is, which Q module to select under what circumstances. The learning is 
accomplished through (separate) Q-learning. 

The overall algorithm is as follows: 

- 1. Observe the current state s. 

- 2. The currently active Q/CQ pair takes control. If there is no active 

one (when the system first starts), go to step 5. 

- 3. The active CQ selects and performs a control action based on 

CQ{s,ca) for different ca. If the action chosen by CQ is end, go 
to step 5. Otherwise, the active Q selects and performs an action 
based on Q{s,a) for different a. 

- 4. The active Q and CQ performs learning. Go to step 1. 

- 5. AQ selects and performs an abstract control action based on AQ{s,aa) 

for different aa, to select a Q/CQ pair to become active. 

- 6. AQ performs learning. Go to step 1. 
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Learning The learning rules, along with explanations of their purposes, are 
detailed as follows. 

— Individual action modules. For the active Qk, when neither the current action 
nor the next action by the corresponding CQk is end, ^ we use the usual 
Q-learning rule: 

AQk{s,a) = a{g{s) + 7 max (s', a') - Qk{s,a)) 

a' 

where s' is the new state resulting from action a in state s. 

Explanation: When CQk decides to continue, Qk learns to estimate reinforcement 
generated (1) by its current action and (2) by its subsequent actions (assuming a 
greedy policy: taking the action that has the highest value in a state), (3) plus the 
reinforcement generated later by other modules after QkjCQk relinquishes control 
(see the next learning rule). The value of Qk is the sum of the estimates of these 
three parts. 

When the current action oi CQk (in s) is continue and the next action of 
CQk (in s') is end, the Qk module receives as reward the maximum value of 
AQ: 

AQk{s,a) = a{g{s) + 'f max AQ {s' ,aa') - Qk{s,a)) 

aa' 

where s' is the new state (resulting from action a in state s) in which control 
is returned to AQ by CQk, and aa is any abstract action of AQ. 
Explanation: As the corresponding CQ terminates the control of the current Q 
module, AQ has to take an abstract action in state s' , the value of which is the 
(discounted) total reinforcement that will be received from that point on (by the 
whole system), if that action is taken and if thereafter the greedy policy (taking 
the action that has the highest value in the current state) is followed by AQ 
and the current (stochastic) policies are followed in all the other modules. So, at 
termination, the Q module learns the total (discounted) reinforcement that will 
be received by the whole system thereon (if the greedy policy is followed by AQ 
and the current stochastic policies are followed by the other modules). Combining 
the explanations of the above two rules, we see that a Q module decides its action 
based on which action will lead to higher overall reinforcement from the current 
state on (both by itself and by subsequent modules). ® 

Remarks: If reinforcement is given only at the end when the goal is reached, the 
above learning rules amount to assigning a reward to a module based on the number 
of steps to the goal from the state in which it was terminated. Instead of counting 
the number of steps to the goal, we simply use discounting of reinforcement to 
achieve the same effect (because the total amount of discounting is determined 
by the number of steps to the goal). Thus, when Q is terminated, it estimates in 

Note that there is a one-step delay in the application of the learning rule. 

® Note that we are being optimistic about the selections to be made by AQ. An 
alternative (the SARSA learning rule) is to use AQ{aa,s) in the above formula, 
where aa is the actual abstract action taken by AQ. But that change will not produce 
an estimated average of reinforcement unless the learning rules for AQ are changed 
accordingly. 
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this way how close it has gotten to the goal at that point. In this way, we do not 
need to use separate rewards for individual modules (which require a priori domain 
knowledge to set up). The same can be said about CQ below. 

Individual controllers. For the corresponding CQk, there are also two sepa- 
rate learning rules, for the two different actions. When the current action by 
CQk is continue, the learning rule is the usual Q-learning: 

^CQk{s, continue) = a{g{s) + ymax ca') — CQk{s, continue)) 

ca' 

where s' is the new state resulting from action continue in state s and ca' 
is any control action by CQk- 

Explanation: When CQk decides to continue, it learns to estimate reinforcement 
generated by the actions of the corresponding Qk, assuming that CQk gives up 
control whenever end has a higher value, plus the reinforcement generated after 
Qk/CQk relinquishes control (see the next learning rule). That is , CQk learns 
the value of continue as the sum of the expected reinforcement generated by Qk 
until CQk(end) > CQk{continue) and the expected reinforcement generated by 
subsequent modules thereafter. 

When the current action by CQk is end, the learning rule is 

ACQk{s, end) = a{maxAQ{s, aa) — CQk{s, end)) 

aa 

where aa is any abstract control action by AQ. 

Explanation: That is, when a CQ ends, it learns the value of the best abstract 
action to be performed by AQ, which is equal to the (discounted) total reinforce- 
ment that will be accumulated by the whole system from the current state on (if 
the greedy policy is followed by AQ and the current stochastic policies are followed 
by the other modules). Combining the above two learning rules, in effect, the CQ 
module learns to make its decisions by comparing whether giving up or continuing 
control will lead to more overall reinforcement from the current state on. 
Abstract controllers. For AQ, we have the following learning rule: 

AAQ{s, aa) = a(ag(s) -I- 7™ max AQ(s', aa') — AQ(s, aa)) 

aa' 

where aa is the abstract control action that selects a Q module to take 
control, s' is the new state (after s) in which AQ is required to make a 
decision, m is the number of time steps taken to go from s to s' (used 
in determining the amount of discounting 7™), and ag is the (discounted) 
cumulative reinforcement received after the abstract control action for state 
s and before the next abstract control action for state s' (that is, ag{s) = 
Efc=o,i,....,m-i5(sfc)7^> where sq = s and = s')- 

Explanation: AQ learns the value of an abstract control action that it selects, 
which is the (discounted) cumulative reinforcement that the chosen Q module will 
accrue before AQ has to make another abstract action decision, plus the accumula- 
tion of reinforcement from the next abstract action on, assuming the greedy policy 
will be followed by AQ and the current (stochastic) policies will be followed by the 
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other modules. In other words, AQ estimates the total reinforcement to be accrued 
(if it is greedy and all the other modules follow their current policies). So in effect 
AQ learns to select lower-level modules by comparing what selections lead to more 
overall reinforcement from the current point on. 

Technically, we only need either CQ’s or AQ, but not both, since their values 
are closely related (if an identical input representation is given to them). Howe- 
ver, as we will discuss later, we have to keep them separate when we have more 
than two levels. Furthermore, if we introduce different input representations for 
different levels, we must keep them separate (even in the case of two levels). Fi- 
nally, if we want these modules to be independent, autonomous entities (agents), 
then the separation of CQs and AQ is justified on the basis of separate learning 
by separate entities. 



Temporal Representation in Modules In SSS, we may need to construct 
temporal representations, in order to avoid excessive segmentation when there 
is too much temporal dependency. One possibility is recurrent neural networks 
(e.g., Elman 1990). RNNs have been used to represent temporal dependen- 
cies of various forms in much previous work, such as Elman (1990), Lin (1993), 
Whitehead and Lin (1995), and Giles et al (1995). RNNs have shortcomings, 
such as long learning time, possible inaccuracy, possibly improper generalizati- 
ons, and so on. Theoretically, a RNN can memorize arbitrarily long sequences, 
but practically, due to limitations on precision, the length of a sequence that can 
be memorized is quite limited (Lin 1993). 

As a more viable alternative to RNNs, we may use decision trees in RL, 
along the line of McCallum (1996), which splits a state if a statistical test shows 
that there are significant differences among the cases covered by that state (that 
is, if the state can be divided into two states by adding features from preceding 
steps and provide more accurate assessment of the (discounted) cumulative rein- 
forcement that will be received). Compared with a priori specifications of non- 
Markovian temporal dependencies (i.e., specifying exactly which preceding step 
is relevant ahead of the time), this method is more widely applicable, because a 
priori knowledge may not be available. Compared with RNNs, this method can 
be more robust, because, due to limited precision, temporal representation in 
RNNs can fade away quickly over a few steps and thus RNNs may have trouble 
dealing with long-range dependencies. Compared with using a full-scale tempo- 
ral representation such as the n-fold state space method (in which an n-fold 
state space is made up of states and their corresponding n — 1 preceding states), 
the advantage of this method is that it may (ideally) seek out those and only 
those relevant preceding steps that are involved in temporal dependencies and 
thus reduces the complexity of representation. The shortcoming of the method is 
that, if unconstrained, the search for relevant historic features can be exhaustive 
and thus costly. 




Segmentation of Sequences through Hierarchical Reinforcement Learning 



249 



3.3 Extensions to More Levels 

The above specification is with respect to a two-level hierarchy, the component 
of which can be denoted as AQo{= Q), CQo{= CQ), and AQi{= AQ). We call 
it the zeroth-order system. We can easily extend it to more levels. That is, on 
top of AQo’s, CQo's, and AQAs, we have AQ 2 for selecting among AQxS as 
well as CQi that determines when the corresponding AQi should give up. We 
obtain a first-order system this way. Then, on top of AQ 2 ’s, we can have AQ 3 for 
selecting among AQ 2 ’s and CQ 2 A for determining termination. That is, we use 
a number of first-order systems to develop a second-order system. This process 
can be repeated for any number of times. 

At any step, there should always be an active AQi j CQi pair at each level i 
(where i = 0, 1, ...., M). We use k (or j) to denote the pair of controllers that is 
active. Here is the algorithm: 

/* using a bottom-level module */ 

- 1. Observe the current state s. 

- 2. The currently active bottom-level pair, AQq/CQq, takes control. If 

there is no active one, set i to M, activate the controller AQmICQm, 
and go to step 5. 

- 3. The active CQq selects an action based on CQQ(s,ca) for different 

ca. If the action chosen by CQq is end, set i to I and go to step 
5. Otherwise, the corresponding AQq selects and performs an action 
based on AQ§(s,a) for different a. 

- 4. The active AQq and CQq perform learning in their respective ways. 

Go to step 1. 

/* moving up or down the hierarchy */ 

- 5. The currently active controller at level i, CQf /AQf, takes control. 

CQi selects an action based on CQi{s,ca) for different ca. {CQm 
always selects continue.) 

- 6. If the CQi decision is to continue, an CQ\_^/ AQ\_^ pair is selected 

by AQ^ and activated (i > 0). 

- 6.1. CQ^ and AQ^ perform learning respectively. 

- 6.2. If i — 1 >0, set ito i—1 and go to step 5 (moving down the hierar- 

chy). Otherwise, go to step 3 (reaching the bottom of the hierarchy). 

- 7. If the CQ^ decision is to end, find the currently active controller at 

the higher level i+l, CQ\j^.il AQ\j^^ {i < M). 

- 7.1. CQi and AQi perform learning respectively. 

- 7.2. Set ito i + l and go to step 5 (moving up the hierarchy) (i < M). 

The new learning rules are as follows (where p is a penalty term, which is set 
to 0 in our experiments). 

— For the active AQq, when neither the current action nor the next action (at 

the next time step) by CQq is end: 

AAQQ(s,a) = a(g(s) + j max AQq (s', a') - AQQ(s,a)) 

n' 
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where s' is the new state resulting from an action a in state s. When the 
current action by CQq is continue but the next action by CQq is end, 

AAQq(s, a) = a(g(s) + j max AQ((s', aa') — p — AQq(s, a)) 

aa' 

where p is a penalty for moving up the hierarchy and s' is the new state 
resulting from action a in state s. AQ\ is the active module at the higher 
level at the next time step (s'). 

— For the active CQq, there are two separate learning rules for two different 
actions: 

AC Qq{s, continue) = a{g{s) + 7 max CQg (s', ca') — CQq{s, continue)) 

ca' 

ACQq{s, end) = a{maxAQ\{s, aa) — p — CQq{s, end)) 

aa 

where s' is the new state resulting from an action by AQq in state s and p 
is a penalty for moving up the hierarchy. AQ\ is the active module at the 
higher level at the current time step. 

— For the active AQ'D, where i = 1,2, M, we have the following learning 
rule, when both the current action and the next action by the corresponding 
CQ'D is continue: 

AAQ’^{s, aa) = a{ag{s) + y"' max AQ’)) {s', aa') — AQ’^{s, aa)) 

aa' 

where aa is the abstract control action that selects a lower-level module, s' is 
the new state encountered (after aa is completed), m is the number of time 
steps taken to go from s to s' (used for determining the amount of discounting 
y"'), and ag{s) is the (discounted) cumulative reinforcement received after 
the current abstract control action for state s and before the next abstract 
control action for state s' (that is, ag{s) = X)fe=o 1 m-i 9i^k)l^, where 
So = s and Sm = s'). When the current action by the corresponding CQf is 
continue but the next action by the CQ’i is end, 

AAQ^{s, aa) = a{ag{s) + y™ max (s', aa') — p — AQf (s, aa)) 

aa' 

where aa is the abstract control action that selects a lower-level module, s' 
is the new state encountered (after aa is completed). is the active 

module at the higher level when s' is encountered. 

— For the active where i = 1,2, M, we have the following learning 

rule for end: 

ACQi{s,end) = a{maxAQ{,,{s,aa) — p — CQ^{s,end)) 

aa ' 

where p is a penalty for moving up a hierarchy and AQl_^_.^ is the active 
module at the higher level at the current time step. We have the following 
learning rule for continue: 

AC Q^{s, continue) = a{ag{s) + j'" max CQ'^ {s' ,ca') — CQ^{s, continue)) 

rn ' 
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where s' is the next state in which CQ^ is to make a decision, and m is the 
number of time steps taken to go from s to s'. 

Note that, at different levels, we can set different learning rates and tempe- 
ratures (for CQiS and AQis). Note also that, although a two-level hierarchy can 
be easily simplified (if the same representation is used throughout) by examining 
the values of all the CQ’s without using AQ at all in the selection of lower-level 
modules, a hierarchy of more than two levels cannot be easily collapsed into one 
level in this way. Hierarchical decision making has its advantages. 

4 Analysis 

4.1 Analysis of Non-markovian Dependencies 

Let us examine several special cases of non-Markovian dependencies that are 
particularly suitable for SSS. For simplicity, we will assume that a two- level 
hierarchy is used. We use “non-Markovian tasks” to refer to tasks in which non- 
Markovian dependencies (as defined in section 1) exist, and “Markovian tasks” 
to refer to tasks in which no such dependencies exist. 

1 . Using Markovian subsequencing to turn non-Markovian tasks into Markovian 
ones. This is the simplest case of dealing with non-Markovian dependencies, 
which was addressed by a similar but more limited model proposed by Wier- 
ing and Schmidhuber (1998). An example used in Wiering and Schmidhuber 
(1998) is as follows: An agent is supposed to turn left at the first intersection, 
and then turn right at the second. Thus it is a non-Markovian sequence, as 
the second turn is dependent on the first turn (assuming that states are com- 
posed of visual indicators for intersections and other features) . However, if we 
make a switch to a different (sub)sequence after the first turn, we then have 
a chain of two Markovian subsequences, instead of one non-Markovian se- 
quence. Through the use of subsequencing, a non-Markovian sequence may 
become multiple Markovian sequences in a chain, as higher-level chaining 
provides the necessary context for local Markovian decision making. The ac- 
tion modules (for handling respective subsequences) have different policies, 
in accordance with the global context, i.e., their respective positions in a 
chain. ® 

Segmentation that turns a non-Markovian sequence into multiple Marko- 
vian sequences in a chain can be accomplished by SSS through (implicitly) 
comparing overall reinforcement received through segmentation at different 
points. Both the higher-level and lower-level modules use atemporal repre- 
sentation in this case (i.e., without memory of any sort). Because a better 
way of segmentation will lead to a higher amount of overall reinforcement, 
eventually better ways of segmentation will likely dominate. ^ This is an 

® Clearly, not all non-Markovian sequences with long-range dependencies can be hand- 
led this way. The scenario represents a small portion of non-Markovian situations. 

^ We cannot guarantee that though, because it is possible to be trapped in a “local 
minimum”. The same caveat applies below. 
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extension of the approach of Wiering and Schmidhuber (1998). In Wiering 
and Schmidhuber’s (1998) model, an a priori determined chain of modules 
is used, and each module has to select a subgoal and keep running until the 
goal is achieved (and cannot be flexibly switched off). 

2. Using non-Markovian subsequencing to turn non-Markovian tasks into Mar- 
kovian ones. When non-Markovian dependencies are dense but limited wit- 
hin small temporal segments (subsequences), if we can isolate and deal with 
separately these small segments, the overall task can be treated as Marko- 
vian. The difficulty lies in how we determine these local segments. Using a 
two-level structure in which the lower level uses temporal representation and 
the higher level uses atemporal representation, SSS can determine relevant 
subsequences through (implicitly) comparing overall reinforcement received 
through segmentation at different points. Segmentation at the proper points 
that separate local segments will result in the largest overall reinforcement 
(because such segmentation leads to capturing all the dependencies), and 
therefore those points will likely be adopted. The algorithm typically ad- 
justs the lengths of subsequences in different modules to accommodate the 
different lengths of the non-Markovian dependencies in different local seg- 
ments. ® 

We refer to such a process as “automatic temporal localization of non- 
Markovianness” . We refer to sequences in which non-Markovian dependen- 
cies can be localized as having the “temporal locality property” (as opposed 
to the “spatial locality property” assumed in Dayan and Hinton 1993 for 
deriving their spatially hierarchical model). Although local non-Markovian 
dependencies can be handled by further segmentation without the use of 
temporal representation, it is unwieldy to do so, because too many segments 
may be generated and thus too many modules may be needed. 

3. Using Markovian subsequencing and non-Markovian meta-sequencing for 
handling long-range (but sporadic) non-Markovian dependencies. Using a 
two-level structure in which the lower level uses atemporal representation 
and the higher level uses temporal representation, SSS can seek out and 
handle long-range dependencies. Under the assumption that long-range de- 
pendencies are sporadic, the higher-level process will determine those points 
in a sequence that are involved in a non-Markovian dependency, while the 
lower-level processes will handle Markovian processes in between. Ideally, 
the segmentation of subsequences occurs at those points that are involved 
in long-range dependencies. The SSS algorithm is capable of such segmen- 
tation, because it can (implicitly) compare overall reinforcement received 
through segmentation at different points and tend to And the best way(s) 
of segmentation that results in the largest reinforcement. Given the higher- 
level temporal representation and the lower-level atemporal representation, 
segmentation at those points that are involved in long-range dependencies 

® In addition, local non-Markovian processes can be different from each other, and 
thus the very segmentation into these local subsequences captures certain global 
non-Markovian dependencies (as in the first case). 
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will result in the largest overall reinforcement (because such segmentation 
leads to capturing all the dependencies without involving other states unne- 
cessarily), and therefore those points will likely be adopted. 

4. Separating the handling of coexisting short-range and long-range non-Mar- 
kovian dependencies. Our approach can also be suitable for dealing with a 
mixture of long-range and short-range (global and local) non-Markovian de- 
pendencies. In the presence of both types of dependencies, SSS typically finds 
a way of separating different (global or local) dependencies and treat them 
separately. Using temporal representation at both the higher and lower level, 
the separation of the two types of dependencies is accomplished by compa- 
ring overall reinforcement received through segmentation at different points. 
The better ways of segmentation that treat global and local non-Markovian 
dependencies separately and correctly should result in more overall reinforce- 
ment eventually. This separation may have various advantages. For example, 
it may simplify the overall complexity of temporal representation. It may also 
facilitate the convergence of learning. ® 

In all, SSS learns to use both global and local contexts through seeking out 
proper configurations of temporal structures (global and local dependencies). 

We can characterize different dimensions of non-Markovian temporal depen- 
dencies as follows: 

— Degree of dependency: the maximum number of previous states that the 
current step is dependent on. In one extreme, the degree can be unlimited 
or infinite. In the other extreme, the degree can be zero, in which case there 
is no temporal dependency. 

— Distance of dependency: the maximum number of steps between the cur- 
rent step and the furthest previous state that the current step is dependent 
on (given all the intermediate steps). In one extreme, the distance can be 
unlimited or infinite. In the other extreme, the distance can be zero. 

— Density of dependency: the ratio between the degree of dependency and the 
distance of dependency. 

According to these dimensions, (I) sporadic long-range dependencies involve 
long distance but low density temporal dependencies. Thus they can be hand- 
led by Markovian subsequencing and non-Markovian meta-sequencing, because 
such dependencies can be captured as segmentation points and handled by tem- 
poral representation at the higher level, which makes the lower level Marko- 
vian (because non-Markovian dependencies are of long distance and low density, 
they need not show up in lower-level processes). (Note that it is also possible 

® Local non-Markovian processes can be different from each other, and thus the 
segmentation per se captures certain (global) non-Markovian dependencies. So it is 
also possible to use atemporal representation at the higher level if such representation 
is sufficient to handle existent global non-Markovian dependencies (as in the first 
case). 
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that Markovian subsequencing and non-Markovian meta-sequencing can handle 
a non-Markovian task with long distance, high density, and high degree temporal 
dependencies, if subsequencing can insulate lower-level processes from such de- 
pendencies, as in the case of e.g. the parity problem when presented sequentially.) 
(2) Using Markovian subsequencing (without non-Markovian meta-sequencing) 
to turn a non-Markovian task into a Markovian task is a special case in which 
\ow-density (long- or short- distance) dependencies disappear when the task is 
properly segmented, due to, on the one hand, the localization of the lower-level 
processes and, on the other hand, the insulation of most of the states from the 
higher-level process. These processes likely become Markovian because the low- 
density temporal dependencies are likely absorbed by the chain of Markovian 
processes. (3) Using non-Markovian subsequencing to turn non-Markovian tasks 
into Markovian ones typically involves temporal dependencies that are of short 
distance (but maybe of high density and of high degree) and can be localized 
within the lower-level processes. (4) Separating the handling of coexisting global 
and local non-Markovian dependencies typically involves cases in which temporal 
dependencies are of varied distances that are preferably bifurcated (i.e., natu- 
rally separated into two categories: long-distance and short- distance), so that 
each type can be handled by a corresponding level. 



4.2 Analysis of Difficulties 

The major difficulties SSS faces are the following: 

— Large search space. The simultaneous learning of multiple levels (involving 
abstract control, control, and individual action modules) results in a much 
larger search space in which the system has to find an optimal (or near opti- 
mal) configuration. In comparison, structurally pre-determined hierarchical 
RL has much smaller spaces to deal with. 

— Cross-level interactions. Between any two adjacent levels, the higher-level 
control interacts with the lower-level control. In other words, the learning 
of the lower-level control is based on the higher-level control, but in turn 
the higher-level control need also to adapt based on the learning and perfor- 
mance of the lower-level control. Thus, there are complex dynamic interac- 
tions (section 3.1). Structurally pre-determined hierarchies do not deal with 
such problems 

— Within-level interactions. Interactions can occur among different modules 
at the same level too. Due to learning, the performance of one module can 
change and therefore affect other modules. While these other modules adapt 
to the change through learning, their outcomes in turn affect the original 
module. Again, there are complex, dynamic interactions. 

There are in general three possible training regimes (all of which can be used 
in our model), from the computationally simplest, which reduces most the in- 
teractions between different components of a hierarchical system, to the most 
complex, as follows: 
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— Incremental elemental-to-composite learning (Lin 1993, Singh 1994): first 
train the system using only “elemental” tasks (with each elemental task 
being mapped to a particular module) and then, after the completion of the 
training on elemental tasks, train the system using composite tasks (which 
are composed of elemental tasks). There is in effect no autonomous self- 
segmentation in this case. Segmentation is pre-determined. This is therefore 
the easiest setting. 

— Simultaneous elemental-and-composite learning (Singh 1994, Tham 1994): 
train the whole system using both elemental and composite tasks (each as- 
sociated with a task label) simultaneously. In this case, the elemental tasks, 
learned along with composite tasks, provide the clues for segmenting the 
composite tasks. So the segmentation in this case is not autonomous either. 

— Simultaneous composite learning: train all the components of the system si- 
multaneously using the same composite tasks with the same reinforcement 
information from the environment; in other words, the system learns to de- 
compose the task autonomously by itself. This is the most difficult setting 
and the focus of this work. 

Thus, although SSS can accommodate any of these methods, it does not require 
separate training for different modules, incremental training that goes from ele- 
mental to composite tasks, or simultaneous training of composite and elemental 
tasks (pre-segmentation). SSS instead relies on reinforcement for completely au- 
tonomous self-segmentation: It compares different amounts of reinforcement re- 
sulting from different ways of segmentation and tends to choose the best way(s) 
on that basis. 

5 Experiments 

We will discuss two examples below. The first domain is taken from McCallum 
(1996 b). It is chosen because it is simple but involves multiple possible segments, 
which is useful in illustrating our approach. The second domain is taken from 
Tadepalli and Dietterich (1997), which demonstrates the reuse of segments. In 
each domain, we will show that proper configurations of modules can be learned. 
For simplicity, we will use a two-level hierarchy. 



Maze 1 In this maze, there are two possible starting locations and one goal 
location. The agent occupies one cell at a time and at each step obtains local in- 
formation (observation) concerning the four adjacent cells (the left, right, above, 
and below cell) regarding whether each is an opening or a wall. It can make a 
move to any adjacent cell at each step (either go left, go right, go up, or go 
down). But if the adjacent cell is a wall, the agent will remain in the original cell 
at the end of the step. See Figure 1, where “1” indicates the starting cells. Each 
number indicates an observational state (i.e., an observation, not a true state) 
as perceived by the agent. In this domain, the minimum path length (i.e., the 
number of steps of the optimal path from either starting cell to the goal cell) is 
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6. When not all the paths are optimal, the average path length is a measure of 
the overall quality of a set of paths. The reinforcement structure (for applying 
Q-learning) is simple: The reward for reaching the goal location is 1, and no 
other reward (or punishment) is given. 

The parameter settings for Q-learning modules are as follows: the Q value 
discount rate is 0.95, the initial learning rates for all modules (ag, a^g) 
are uniformly 0.9, the learning rates change according to a* = (where 

t is the number of episodes completed), the initial temperatures are Tq = 0.5 
and T^g = Tqq = 0.9, and the temperatures change according to r* = 

(where t is the number of episodes completed). The total number of training 
episodes is 5,000. The number of steps allowed in each episode during learning is 
100 (an episode ends when the limit is reached, or as soon as the goal is reached) 

As shown in Figure 1, the three cells marked as “5” are perceived to be the 
same by the agent. However, different actions are required in these cells in order 
to obtain the shortest paths to the goal. Likewise, the two cells marked as “10” 
are perceived to be the same but require different actions. In order to remove 
non-Markovian dependencies in each module, we can divide up each possible 
sequence (from one of the starting cells to the goal) into a number of segments 
so that, when different modules are used for these segments, a consistent policy 
can be adopted in each module. 
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Fig. 1. A maze requiring segmentation. Each number (randomly chosen) indicates an 
unique observational state as perceived by the agent. 




Fig. 2. Different segmentations: (a) using two modules, (b) using three modules, (c) 
using four modules. Each arrowed line indicates an action (sub)sequence produced by 
a module, learned by SSS. 
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As confirmed by the results of our experiments, a single agent (with atem- 
poral representation) could not learn the task at all (the learning curve was 
flat), due to oscillations (and averaging) of Q values. Adding temporal repre- 
sentation (i.e., memory) may partially remedy the problem, but this approach 
has difficulty dealing with long-range dependencies (see earlier discussions) and, 
moreover, it is not comparable with SSS using atemporal representation. Thus it 
was not used here. On the other hand, as confirmed by the experiments, SSS can 
segment sequences to remove non-Markovian dependencies. In this domain, it 
can be easily verified that a minimum of two Q/CQ pairs (modules) are needed 
in order to remove non-Markovian dependencies in each module. See Figure 2. 
However, finding the optimal paths using only two modules is proven difficult, 
because the agent has to be able to find switching points (between the two mo- 
dules) that are exactly right for a sequence. In general, in this domain, the more 
modules there are, the easier the segmentation can be done (that is, the faster the 
learning is). This is because the more modules there are, the more possibilities 
(alternatives) there are for a proper segmentation that removes non-Markovian 
dependencies. An ANOVA analysis (number of modules x block of training) 
shows that there is a significant main effect of number of modules {p < 0.05) 
and a significant interaction between number of modules and block of training 
{p < 0.05), which indicates that the number of modules has significant impact 
on the learning performance. The performance of the resulting systems is shown 
in Figure 3, under two conditions: using the completely deterministic policy in 
each module, or using a stochastic policy (with the Boltzmann distribution ac- 
tion selection with r = 0.006). In either case, as indicated by Figures 3, there 
are significant performance improvements in terms of percentages of the optimal 
path traversed and in terms of the average path length, when we go incremen- 
tally from 2 modules to 14 modules. Pairwise t tests on successive average path 
lengths confirmed this conclusion (p < 0.05). See Figure 3 for the test results. 

As a variation, to reduce the complexity of this domain and thus to facilitate 
learning, we limit the starting location to one of the two cells. This setting is 
easier because there is no need to be concerned with the other half of the maze 
(see Figure 4). The learning performance is indeed enhanced by this change. 
We found that this change leads to better learning performance using either 
two, three, or four modules. Another possibilities for reducing the complexity is 
to always start with one module. Again, the learning performance is enhanced 
by this change. Therefore, the segmentation performance is determined to a 
large extent by the inherent difficulty of the domain that we deal with. The 
performance changes gradually in relation to the change in the complexity of 
the domain. 

In all, the above experiments demonstrated that proper segmentation was 
indeed possible using SSS. Through learning a set of Markovian processes (with 
atemporal representations at both the higher and lower level) , SSS removed non- 

When there are more modules available, there are more possibilities of segmenta- 
tion. For example, some of the possible ways of segmentation, using three and four 
modules, are shown in Figure 2. 
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Fig. 3. The effect of number of modules on the performance (after learning), in terms 
of the percentage of optimal paths and the average path length (with the maximum 
length of 24 imposed on each path). 



Markovian dependencies and handled the sequences using a (learned) chain of 
Markovian processes. It was also clear from the experiments that it was not 
guaranteed that the segmentation (learning) process would succeed. The con- 
vergence to optimal solutions was probabilistic (as illustrated by Figure 3). 




Fig. 4. A possible segmentation using two modules, when only one starting location is 
allowed. 



Maze 2 In this maze, there is one starting location at one end (marked as “2”) 
and one goal location at the other. However, before reaching the goal location, 
the agent has to reach the top of each of the three arms. At each step, the agent 
obtains local observation concerning the four adjacent cells regarding whether 
each is an opening or a wall. See Figure 5, where each number indicates an 
observational state (an observation) as perceived by the agent. The agent can 
make a move to any adjacent cell at each step (either go left, go right, go up, 
or go down). But if the adjacent cell is a wall, the agent will remain in the 
original cell at the end of the step. The reward for reaching the goal location 
(after visiting the tops of all the three arms) is 1. 

The parameter settings for the modules are as follows: the Q value discount 
rate is 0.97, the initial learning rates for all the modules {oi^,Oi\Q,a^Q) are 
uniformly 0.9, the learning rates change according to a* = (where t is 

the number of episodes completed), the initial temperatures are Tq = 0.6 and 
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r^Q = Tqq = 0.8, and the temperatures change according to t* = . 

The total number of training episodes is 10,000. The number of steps allowed in 
each episode during learning is 200 (an episode ends when the limit is reached, 
or as soon as the goal is reached after having reached the tops of all the three 
arms). 
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Fig. 5. Maze 2: A maze requiring segmentation and reuse of modules. Each number 
(randomly chosen) indicates an observational state as perceived by the agent. 
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Fig. 6. A segmentation using only two modules. Each arrowed line indicates an action 
sequence produced by a module, learned by SSS. The two modules alternate. 



In this domain, the shortest path consists of 19 steps. A minimum of two 
modules (two Q/CQ pairs) are needed (using atemporal input representation), 
in order to remove non-Markovian dependencies through segmentation and thus 
to obtain the shortest path. (As confirmed by the results of our experiments, a 
single agent with atemporal representation cannot learn the task at all.) With 
two modules, there is exactly one way of segmenting the sequence (considering 
only the shortest path to the goal) : switching to a different module at the top cell 
of each arm (marked as “4” ) and switching again at the middle cell between any 
two arms (marked as “10”). See Figure 6. As confirmed by our experiments, this 
segmentation allows repeated use of the same modules along the way to the goal. 
Reuse of modules lead to the compression of sequences. When more modules are 
added, reuse of modules might be reduced: a third module, for example, can be 
used in place of the second use of module 1 (Figure 6). 

As shown by the experiments, in this domain, unfortunately, learning does 
not become easier when we add more modules. This is because, in this domain, 
there is only exactly one way of segmentation (that is, switching at the top 
cell of each arm and at the cell between any two arms) that may lead to an 
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optimal path, and thus there is no advantage in using more modules. Instead, 
we observed in the experiments a slight (but statistically significant) decrease of 
performance when more modules are added, in terms of both the learning process 
and the learned product. In general, the more modules we have, the slower 
the learning is. An ANOVA analysis (number of modules x block of training) 
shows that there is a significant main effect of number of modules (p < 0.05) 
and a significant interaction between number of modules and block of training 
(p < 0.05), which indicates that the number of modules has significant negative 
(detrimental) impact on learning. After learning, in terms of the performance 
of the resulting systems, we looked into both percentages of the optimal path 
(under both the deterministic and stochastic condition) and average path lengths 
(under both conditions). See Figures 7. The t test shows that the increases in 
average path lengths (i.e., the degradation of performance), corresponding to 
the increases from 2 to 3 modules and from 3 to 4 modules, are statistically 
significant. 
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Fig. 7. The effect of number of modules on the performance (after learning), in terms 
of the percentage of optimal paths and the average path length (with a maximum 
length of 39). 



In all, the experiments in this domain further demonstrated the point made 
earlier regarding the feasibility of self-segmentation. Non-Markovian dependen- 
cies were removed through a set of Markovian processes (with atemporal repre- 
sentations). Furthermore, the experiments demonstrated the reuse of a module 
(in several different places) in a sequence or, in other words, the possibility of 
subroutines that could be called into use any number of times. 

Reuse of modules has significant advantages. It leads to the compression of 
descriptions of sequences. Moreover, it allows the handling of sequences that 
cannot be (efficiently) handled otherwise. For example, the above domain could 
not be handled by simpler models in which segmentation was limited to a linear 
chaining (of a small number of modules) without any way for reusing modu- 
les (such as Wiering and Schmidhuber 1998). However, a simpler model might 
handle the domain by using six or more modules so that a chain of six modules 
could be formed. This approach resulted in inefficiency. While a minimum of two 
modules was required for SSS in this domain, Wiering and Schmidhuber’s model 
required at least 6 modules in order to obtain the optimal path. If we increased 
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the number of arms to, say, 10, then at least 20 modules would be required by 
Wiering and Schmidhuber’s model, while two modules would still suffice with 



SSS. 



Note that we did not experimentally compare SSS with e.g. Precup et al 
(1998), Dietterich (1997), and Parr and Russell (1997), because these algorithms 
require a substantial amount of a priori domain knowledge and thus are not 
comparable to SSS. On the other hand, although Parr and Russell (1995) and 
McCallum (1996b) deal with partial observability without requiring a priori kno- 
wledge, they do not segment sequences and develop hierarchies of subsequences, 
and thus are not comparable to SSS in that sense. 



6 Summary 

The advantages of the algorithm can be summarized as follows: 

~ In SSS, there are no pre-determined hierarchies, no a priori domain-specific 
structures to help with the formation of hierarchies, no domain-specific built- 
in ways for creating hierarchies, and so on. In sum, there is no use of a priori 
domain-specific knowledge in forming hierarchies. 

— SSS uses only one single learning principle (that is, the maximization of 
expected reinforcement). 

— SSS seeks out somewhat proper (and sometimes optimal) configurations of 
hierarchical structures, in correspondence with temporal dependency struc- 
tures of a domain (extending Wiering and Schmidhuber 1998). 

— SSS is able to develop subroutines, which can potentially be reused at diffe- 
rent points of a sequence (unlike Wiering and Schmidhuber 1998), or shared 
by multiple (different) sequences (as in Thrun and Schwartz 1995). 

— In terms of classical planning, with hierarchical structuring, SSS can be 
viewed as performing planning at different levels of abstraction (Knoblock 
et al 1994), with lower-level modules being macro-actions. 

There is much further work that can be done along the line outlined here. For 
example, we need to scale the algorithm up to deal with much larger state spa- 
ces, in which function approximation or other schemes must be used to handle 
the complexity of Q values, and the speed of convergence must also be expressly 
dealt with. We need to look into domains with probabilistic state transitions 
and/or noisy observations. We may also want to adaptively determine the num- 
ber of levels in such systems through learning, in the same way as the adaptive 
determination of segmentation points within each level. 
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1 Introduction 

Problem formulation is often an important first step for solving a problem effec- 
tively. In sequential decision problems, Markov decision process (MDP) (Bellman 
[2]; Puterman [22]) is a model formulation that has been commonly used, due to 
its generality, flexibility, and applicability to a wide range of problems. Despite 
these advantages, there are three necessary conditions that must be satisfied 
before the MDP model can be applied; that is, 

1. The environment model is given in advance (a completely-known environ- 
ment). 

2. The environment states are completely observable (fully-observable states, 
implying a Markovian environment). 

3. The environment parameters do not change over time (a stationary environ- 
ment). 

These prerequisites, however, limit the usefulness of MDPs. In the past, re- 
search efforts have been made towards relaxing the first two conditions, leading 
to different classes of problems as illustrated in Figure 1. 
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Fig. 1. Categorization into four related problems with different conditions. Note that 
the degree of difficulty increases from left to right and from upper to lower. 
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This paper mainly addresses the first and third conditions, whereas the se- 
cond condition is only briefly discussed. In particular, we are interested in a 
special type of nonstationary environments that repeat their dynamics in a cer- 
tain manner. We propose a formal model for such environments. We also develop 
algorithms for learning the model parameters and for computing optimal poli- 
cies. 

Before we proceed, let us briefly review the four categories of problems shown 
in Figure 1 and define the terminology that will be used in this paper. 

1.1 Four Problem Types 
Markov Decision Process 

MDP is the central framework for all the problems we discuss in this section. 
An MDP formulates the interaction between an agent and its environment. The 
environment consists of a state space, an action space, a probabilistic state tran- 
sition function, and a probabilistic reward function. The goal of the agent is to 
And, according to its optimality criterion, a mapping from states to actions (i.e. 
policy) that maximizes the long-term accumulated rewards. This policy is called 
an optimal policy. In the past, several methods for solving Markov decision pro- 
blems have been developed, such as value iteration and policy iteration (Bellman 

I). 

Reinforcement Learning 

Reinforcement learning (RL) (Kaelbling et al. 12; Sutton and Barto 28) is ori- 
ginally concerned with learning to perform a sequential decision task based 
only on scalar feedbacks, without any knowledge about what the correct ac- 
tions should be. Around a decade ago researchers realized that RL problems 
could naturally be formulated into incompletely known MDPs. This realization 
is important because it enables one to apply existing MDP algorithms to RL 
problems. This has led to research on model-based RL. The model-based RL ap- 
proach first reconstructs the environment model by collecting experience from 
its interaction with the world, and then applies conventional MDP methods to 
And a solution. On the contrary, model-free RL learns an optimal policy direc- 
tly from the experience. It is this second approach that accounts for the major 
difference between RL and MDP algorithms. Since less information is available, 
RL problems are in general more difficult than the MDP ones. 



Partially Observable Markov Decision Process 

The assumption of having fully-observable states is sometimes impractical in 
the real world. Inaccurate sensory devices, for example, could make this con- 
dition difficult to hold true. This concern leads to studies on extending MDP 
to partially-observable MDP (POMDP) (Monahan 20; Lovejoy 17; White III 
29). A POMDP basically introduces two additional components to the original 
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MDP, i.e. an observation space and an observation probability function. Obser- 
vations are generated based on the current state and the previous action, and 
are governed by the observation function. The agent is only able to perceive 
observations, but not states themselves. As a result, past observations become 
relevant to the agent’s choice of actions. Hence, POMDPs are sometimes refer- 
red to as non-Markovian MDPs. Traditional approaches to POMDPs (Sondik 26; 
Cheng 5; Littman et al. 15; Cassandra et al. 4; Zhang et al. 31) maintain a pro- 
bability distribution over the states, called belief state. It essentially transforms 
the problem into an MDP one with an augmented (and continuous) state space. 
Unfortunately, solving POMDP problems exactly is known to be intractable in 
general (Papadimitriou and Tsitsiklis 21; Littman et al. 14). 

Hidden-State Reinforcement Learning 

Recently, research has been conducted on the case where the environment is both 
incompletely known and partially observable. This type of problems is someti- 
mes referred to as hidden-state reinforcement learning, incomplete perception, 
perception aliasing, or non-Markovian reinforcement learning. Hidden-state RL 
algorithms can also be classified into model-based and model-free approaches. 
For the former, a variant of the Baum- Welch algorithm (Chrisman 8) is typically 
used for model reconstruction, and hence turns the problem into a conventio- 
nal POMDP. Optimal policies can then be computed by using existing POMDP 
algorithms. For the latter, research efforts are diverse, ranging from state-free 
stochastic policy (Jaakkola et al. 11), to recurrent Q-learning (Schmidhuber 24; 
Lin and Mitchell 13), to finite-history-window approach (McCallum 19; Lin and 
Mitchell 13). Nevertheless, most of the model-free POMDP algorithms yield 
only sub-optimal solutions. Among the four classes of problems aforementioned, 
hidden-state RL problems are expected to be the most difficult. 

1.2 Nonstationary Environments 

Traditional MDP problems typically assume that environment dynamics (i.e., 
MDP parameters) are always fixed (i.e., stationary). This assumption, howe- 
ver, is not realistic in many real-world applications. In elevator control (Crites 
and Barto 9), for example, the passenger arrival and departure rates can vary 
significantly over one day, and should not be modeled by a fixed MDP. 

Previous studies on nonstationary MDPs (Puterman 22) presume that chan- 
ges of the MDP parameters are exactly known in every time step. Given this 
assumption, solving nonstationary MDP problems is trivial, as the problem can 
be recast into a stationary one (with a much larger state space) by performing 
state augmentation. Nevertheless, extending the idea to incompletely-known en- 
vironmental changes (i.e., to the reinforcement learning framework) is far more 
difficult. 

In fact, RL (Kaelbling et al. 12; Sutton and Barto 28) in nonstationary en- 
vironments is an impossible task if there exists no regularity in the way envi- 
ronment dynamics change. Hence, some degree of regularity must be assumed. 
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Typically, nonstationary environments are presumed to change slowly enough 
such that on-line RL algorithms can be employed to keep track of the changes. 
The online approach is memoryless in the sense that even if the environment 
ever reverts to the previously learned dynamics, learning must still start all over 
again. There are a few heuristic approaches along this line (Littman and Ackley 
16; Sutton 27; Sutton and Barto 28). 



1.3 The Properties of Our Proposed Model 

Herein we propose a formal environment model (Choi et al. 7) for the nonstatio- 
nary environments that repeat their dynamics over time. Our model is inspired 
by observations from an interesting class of nonstationary RL tasks. Throughout 
this section we illustrate the properties of such nonstationary environments by 
using the elevator control problem as an example. 



Property 1: A Finite Number of Environment Modes 

The first property we observed is that environmental changes are confined to a 
finite number of environment modes. Modes are stationary environments that 
possess distinct environment dynamics and require different control policies. At 
any time instant, the environment is assumed to be in exactly one of these 
modes. This concept of modes seems to be applicable to many, though not all, 
real-world tasks. In the elevator control problem, a system might operate in 
a morning-rush-hour mode, an evening-rush-hour mode and a non-rush-hour 
mode. One can also imagine similar modes for other real-world control tasks, 
such as traffic control, dynamic channel allocation (Singh and Bertsekas 25), 
and network routing (Boyan and Littman 3). 



Property 2: Partially Observable Modes 

Unlike states, environment modes cannot be directly observed. Instead, the cur- 
rent mode can only be estimated according to the past state transitions. It is 
analogous to the elevator control example in that the passenger arrival rate and 
pattern can only be partially observed through the occurrence of pick-up and 
drop-off requests. 



Property 3: Modes Evolving as a Markov Process 

Normally, mode transitions are stochastic events and are independent of the 
control system’s response. In the elevator control problem, the events that change 
the current mode of the environment could be an emergency meeting in the 
administrative office, or a tea break for the staff on the 10th ffoor. Obviously, 
the elevator’s response has no control over the occurrence of these events. 
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Property 4: Infrequent Mode Transitions 

Mode transitions are relatively infrequent. In other words, a mode is more likely 
to retain for some time before switching to another one. Take the emergency 
meeting as an example, employees on different floors take time to arrive at the 
administrative office, and thus would generate a similar traffic pattern (drop-off 
requests on the same ffoor) for some period of time. 



Property 5: Small Number of Modes 

It is common that, in many real-world applications, the number of modes is 
much fewer than the number of states. In the elevator control example, the state 
space comprises of all possible combinations of elevator positions, pick-up and 
drop-off requests, and certainly would be huge. On the other hand, the mode 
space could be small. For instance, an elevator control system can simply have 
the three modes as described above to approximate the reality. 

Based on these properties, an environment model is now proposed. The whole 
idea is to introduce a mode variable to capture environmental changes. Each 
mode specifies an MDP and hence completely determines the current state tran- 
sition function and reward function (property 1). A mode, however, is not direc- 
tly observable (property 2), and evolves with time according to a Markov process 
(property 3). The model is therefore called hidden-mode model. 

Note that the hidden-mode model does not impose any constraint to satisfy 
properties 4 and 5. In other words, the model is flexible enough to work for 
environments where these two properties do not hold. Nevertheless, as will be 
shown later, these properties can be utilized to help the learning in practice. 

The hidden-mode model also has its limitations. For instance, one may argue 
that the mode of an environment should preferably be continuous. While this 
is true, for tractability, we assume the mode is discrete. This implies that our 
model, as for any other model, is only an abstraction of the real world. Moreover, 
we assume that the number of modes is known in advance. We will seek to relax 
these assumptions in future research. 



1.4 Related Work 

Our hidden-mode model is closely related to the nonstationary environment 
model proposed by Dayan and Sejnowski (10). Although our model is more 
restrictive in terms of representational power, it involves much fewer parameters 
and is thus easier to learn. Besides, other than the number of possible modes that 
should be known in advance, we do not assume any other knowledge about the 
way environment dynamics change^. Dayan and Sejnowski, on the other hand, 
assume that one knows precisely how the environment dynamics change. 

^ That is, the transition probabilities of the Markov process governing mode changes, 
though fixed, are unknown in advance. 
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The hidden-mode model can also be viewed as a special case of the hidden- 
state model, or partially observable Markov decision process (POMDP). As will 
be shown later, a hidden-mode model can always be represented by a hidden- 
state model through state augmentation. Nevertheless, modeling a hidden-mode 
environment via a hidden-state model will unnecessarily increase the problem 
complexity. We discuss the conversion from the former to the latter in Section 2.2. 

1.5 Our Focus 

In order for RL to take place, one may choose between the model-based and 
model-free approaches. This paper is primarily concerned with the model-based 
approach, and concentrates on how a hidden-mode model can be learned based 
on the Baum- Welch algorithm. The issue of finding the optimal policy will only 
be addressed briefly. 

1.6 Organization 

The rest of this paper is organized as follows. In the next section, we describe 
the hidden-mode model by defining the hidden-mode Markov decision process 
(HM-MDP) and illustrate how it can be reformulated into a POMDP. Section 
3 will subsequently discuss how a hidden-mode model can be learned in two 
different representations — a POMDP or an HM-MDP. A variant of the Baum- 
Welch algorithm for learning HM-MDP is proposed. These two approaches are 
then compared empirically in Section 4. In Section 5, we will briefly discuss how 
hidden-mode problems can be solved. Then we highlight the assumptions of our 
model and discuss its applicability in Section 6. Finally, Section 7 pinpoints some 
directions for future research and Section 8 summarizes our research work. 



2 Hidden-Mode Markov Decision Processes 

This section presents our hidden-mode model. Basically, a hidden-mode model is 
defined as a finite set of MDPs that share the same state space and action space, 
with possibly different transition functions and reward functions. The MDPs 
correspond to different modes in which a system operates. States are completely 
observable and their transitions are governed by an MDP. In contrast , modes are 
not directly observable and their transitions are controlled by a Markov chain. 
We refer to such a process as a hidden-mode Markov decision process (HM-MDP). 
Figure 2 gives an example of HM-MDP. 

2.1 Formulation 

Formally, an HM-MDP is defined as an 8-tuple (Q, S, A, X, Y, R, 77, W), where Q, 
S and A represent the sets of modes, states and actions respectively; the mode 
transition function X maps mode m to n with a fixed probability of Xmn, the 
state transition function Y defines transition probability, ym{s, a, s'), from state 
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Mode n 

Fig. 2. A 3-mode, 4-state, 1-action HM-MDP. The values Xmn and ym{s,a,s') are the 
mode and state transition probabilities respectively. 



s to s' given mode m and action a; the stochastic reward function R returns 
rewards with the mean value a); U and 'R denote the prior probabilities of 
the modes and of the states respectively. The evolution of modes and states is 
depicted in Figure 3. 

2.2 Reformulating HM-MDP as POMDP 

HM-MDP is a subclass of POMDP. In other words, it is always possible to re- 
formulate the former as a special case of the latter. In particular, one may take 
an ordered pair of any mode and observable state in the HM-MDP as a hidden 
state in a POMDP, and any observable state of the former as an observation of 
the latter. Suppose the observable states s and s' are in modes m and n respec- 
tively. These two HM-MDP states together with their corresponding modes form 
two hidden states (m, s) and (n, s') for its POMDP counterpart. The transition 
probability from (m, s) to (n, s') is then simply the mode transition probability 
Xmn multiplied by the state transition probability ym{s,a, s'). For an M-mode, 
A^-state, iC-action HM-MDP, the equivalent POMDP thus has N observations 
and MN hidden states. A formal reformulation is detailed in Figure 4. 
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Fig. 3. The evolution of an HM-MDP. Each node represents a mode, action or state 
variable. The arcs indicate dependencies between the variables. 
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S' 


= Qx S, Z' = S, A' = A 






T' 


: {pij(a) = Xmn ■ ym{s,a,s') i = {m,s) G 


S', j 


= {n,s') GS'} 


Q’ 


: {<7i(a,s') = |q ; ^ 1 * = (w,s) 


es'. 


s' G Z'} 


R' 


: {ri{a) = rm{s,a) \ i = (m,s) G S'} 






77' 


= {tt' = 7T^ • i/>s 1 * = {m, s) e S'} 







Fig. 4. Reformulating HM-MDP into POMDP 



Note that S',A',T',R',0',Q',II' are the state space, action space, state 
transition function, reward function, observation space, observation function, 
and prior state probabilities of the resulting POMDP respectively. The following 
lemmas prove that HM-MDPs are indeed a subclass of POMDPs. 

Lemma 1. The HM-MDP to POMDP transformation satisfies the state tran- 
sition function requirement, namely ^ pij (a) = 1 . 

j 

Proof: From the transformation, we know that 

'^^Pij(a) = 'y ' P(m,s),(n,s') (®) 

j {n,s')GQ' xS' 

= ^ ^ Xmn- ym{s,a,s') 

n^Q' s'^S' 
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= ^ Xmn- ^ ym{s,a,s') 
n^Q' s'eS' 

~ ^ ^ Xjyiji * 1 

tiGQ' 

= 1 □ 



Lemma 2. HM-MDPs form a subclass of POMDPs. 

Proof: Given an HM-MDP and its corresponding POMDP, there is a 1-1 map- 
ping from the mode-state pairs of the HM-MDP into the states of the POMDP. 
This implies that for every two different pairs of mode and state (m, s) and 
(m'j s'), there are corresponding distinct states in the POMDP. The transforma- 
tion ensures that the two states are the same if and only if m = m' and s = s'. 
It also imposes the same transition probability and average reward between two 
mode-state pairs and for the corresponding states in the POMDP. It implies that 
the transformed model is equivalent to the original one. Therefore HM-MDPs 
are a subclass of POMDPs. □ 

Figure 5 demonstrates the reformulation for a 3-mode, 4-state, 1-action HM- 
MDP and compares the model complexity with its equivalent POMDP. Note that 
most state transition probabilities (dashed lines) in the POMDP are collapsed 
into mode transition probabilities in the HM-MDP through parameter sharing. 
This saving is significant. In the example, the HM-MDP model has a total of 57 
transition probabilities, while its POMDP counterpart has 144. In general, an 
HM-MDP contains much fewer parameters {N'^MK + M^) than its correspon- 
ding POMDP (M^N^K). 

3 Learning a Hidden-Mode Model 

Now it becomes clear that there are two ways to learn a hidden-mode model: 
either learning an HM-MDP directly, or learning an equivalent POMDP. In this 
section, we first briefly discuss how the latter can be achieved by a variant of 
the Baum- Welch algorithm, and then develop a similar algorithm for HM-MDP. 



3.1 Learning a POMDP Model 

Traditional research in POMDP (Monahan 20; Lovejoy 17) assumes a known 
environment model and is concerned with finding an optimal policy. Chrisman 
(8) was the first to study the learning of POMDP models from experience. 

Chrisman’s work is based on the Baum- Welch algorithm, which was originally 
proposed for learning hidden Markov models (HMMs) (Rabiner 23). Based on 
the fact that a POMDP can be viewed as a collection of HMMs, Chrisman 
proposed a variant of the Baum- Welch algorithm for POMDP. This POMDP 
Baum-Welch algorithm requires 0{M'^N^T) time and 0{M‘^N‘^K) storage for 
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Fig. 5. Reformulating a 3-mode, 4-state, l-action HM-MDP described in Figure 2 as 
an equivalent POMDP. For brevity, exact transition probability values are not shown. 
State s of mode m in Figure 2 is relabeled as <m,s >. Note that HM-MDP has much 
fewer model parameters than its POMDP counterpart. 



learning an M-mode, A^-state, AT-action HM-MDP, given T data items. Ho- 
wever, Chrisman’s algorithm does not learn the reward function. One possible 
extension is to estimate the reward function by averaging the obtained rewards, 
weighted by the estimated state certainty. The effectiveness of the algorithm will 
be examined in the next section. 



3.2 HM-MDP Baum- Welch Algorithm 

We now extend the Baum-Welch algorithm to the learning of an HM-MDP 
model. The outline of the algorithm, called HM-MDP Baum-Welch, is shown in 
Figure 6. This new algorithm is similar to the POMDP version, except that the 
auxiliary variables are redefined. 

The intuition behind both algorithms remains the same; i.e., estimating the 
parameters of the hidden variables by counting their expected number of tran- 
sitions. This counting, however, can only be inferred from the observations by 
maintaining a set of auxiliary variables. Suppose that the number of transitions 
from the hidden variable i to j is known, the transition probability from i to j 
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Given a collection of data and an initial model parameter vector 9. 
repeat 
9 = 9 

Compute forward variables at- (Figure 7) 

Compute backward variables Pf (Figure 8 ) 

Compute auxiliary variables and 74 . (Figure 9) 

Compute the new model parameter 9. (Figure 10) 
until max \9i — 9i\ < e 



Fig. 6. The skeleton of HM-MDP Baum- Welch algorithm 



can then be computed by the following equation: 



Pr(j|t) 



the number of transitions from i to j 
the number of visits to i 



where the denominator is simply the numerator summing over all j . 

The central problem now becomes how to count the transitions of the hidden 
variables. While its exact value is unknown, it is possible to estimate the expected 
value through inferences on the observation sequence. We define ^t(t,j) as the 
expected number of state transitions from i to j at time step t. Given a sequence 
of T observations, i.e., T — 1 transitions, the total number of transitions from 

T T 

state I to J becomes ^ ^t(i,j), and the total number of visits to i is X] X) 

t=i j t=i 

Thus far, our algorithm is not different from the standard Baum- Welch algo- 
rithm. The key difference between the two algorithms comes from the inference 
part (i.e., maintaining the auxiliary variables). As the observed state sequence is 
shifted one step forward in our model, additional attention is needed for hand- 
ling the boundary cases. In the subsequent sections, the intuitive meanings and 
definitions of the auxiliary variables will be given. 



Computing Forward Variables 

Let the collection of data D be denoted as (S, A, R) , where S is the observation se- 
quence Si S 2 • • • St, A the random^ action sequence oi 02 • • • ot, and R the reward 
sequence riT 2 • • • rr- We define forward variables at{i) as Pr(si, S 2 , ..., Sj, mt-i = 
i\9,f\), i.e. the joint probability of observing the partial state sequence from si 
up to St and the mode at time t — 1 being in i, given the model 9 and the ran- 
dom action sequence A. To compute the forward variables efficiently, a dynamic 
programming approach is depicted in Figure 7. 

^ It is worth mentioning that although the optimal action is a function of modes and 
states, the choice of action should be at random in the model-learning phase, or 
called exploration phase. 
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Computing Forward Variables: 

for all i G Q, 
ai{i) = tpsi 

for all i G Q, 

a2(i) = TTi ipsi yi(si,ai,S2) 
for all j G Q, 

at+i(j) ^ E at{i)xijyj{st,at,st+i) 

i€Q 



t t+1 




cti(i) Cd+iij) 



Fig. 7. Computing forward variables 



Computing Backward Variables 



Backward variable /3t(f) denotes the probability Pr(st+i, St+ 2 , • • • , stIs*, rrit-i = 
i,9,A), which represents the probability of obtaining the partial state sequence 
St+i,S (+2 up to St, given the state St, the mode being i at time t — 1, the 
model 9, and the random action sequence A. Unlike forward variables, backward 
variables are computed backward from Pt- The computational steps required 
for the backward variables are illustrated in Figure 8. 



Computing Backward Variables: 

for all i G Q, 

I3t{i) = 1 

for all i G Q, 

Pt(i)= E Xij yj{st,at,st+i) Pt+i{j) 

i^Q 

for all i G Q, 

Pi(i) = E yj{si,ai,S2) f32{j) 

jeQ 



t+! 




yi (s,,at,st+i) 



y2(s, ,a,,st+i) 



yN(s, ,a,,st+i) 



P.(i) 



Mj) 



Fig. 8. Computing backward variables 




276 S.P.M. Choi, D.-Y. Yeung, and N.L. Zhang 



Computing Auxiliary Variables 

With the forward and backward variables, two additional auxiliary variables 
^t{hj) and 7 t(f) can now be defined. Formally, ^tihj) is defined as Pr(mt_i = 
i, mt = jjS, A, 9), i.e. the probability of being in mode i at time t—1 and in mode j 
at time t, given the state and action sequences, and the model 0. Figure 9 depicts 
how the forward and backward variables can be combined to compute 



t-i t t+i t+2 




Computing Auxiliary Variables: 

For all i,j £ Q, 

r o- _ «t(*) Xij yj{st,at,st+i) I3t+i{j) 
E arik) 

fceQ 

For all i ^ Q, 

it{i) = E Ct+i(*,i) 

J6Q 



Fig. 9. Computing auxiliary variables 



Auxiliary variable 7 t(t) is defined as Pr(mt = i|S, A,0), i.e. the probability of 
being in mode i at time t, given the state and action sequences, and the model 
6. Then 7 t(i) is simply the sum of over all possible resultant modes j; 

that is, 



N 

lt{i) = 

i=i 
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Parameter Reestimation 

With these auxiliary variables, the model parameters can be updated. In parti- 
cular, the prior mode distribution 7Tj, i.e. the probability of being in mode i at 
time t = 1, is simply the auxiliary variable 7i(t). 

The mode transition probability Xij is the number of mode transitions from 
i to j divided by the number of visits to mode i. Although these two values 
cannot be known exactly, they can be estimated by summing up the variables ^ 
and 7 respectively: 



expected number of mode transitions from i to j 
expected number of visits to mode i 

T 

t^2 

T-1 

t=i 



Similarly, the state transition probabilities and reward function can be re- 
estimated by the following: 

_ ( . ^ expected number of visits to mode i state j, with action k and resulting in I 

expected number of visits to mode i state j, with action k 

T-1 

X! lt{i) S{at,k) S{st+i,l) 

_ 

“ T-1 

EE 7t(*) S{st,j) S{at,k) S{st+i,h) 

h£S i=l 



rz{j, k) 



expected reward received by taking action k at mode i state j 
expected number of visits to mode i state j with action a 

T-1 

E 5{o,t,k) 6{st,j) n 



T-1 

^ 7 t(t) 5{at,k) S(stJ) 

t=i 



The function S(a, b) is defined as 1 if a = 6, or 0 otherwise. It is used for 
selecting the particular state and action. For a small data set, it is possible that 
some states or actions do not occur at all. In that case, the denominators of 
the formulae are equal to zero, and the parameters should remain unchanged. 
Figure 10 summarizes the parameter reestimation procedure. 
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Xij = 



t=2 

E 7t(*) 



if denominator 0 



otherwise 



yiU,k,l) 



/ T— 1 

E 7t(0 5(st+i,0 

t = l 

E ^eS*(*) <5(st+i,0 S{at,k) 

\ ies t=i 



if denominator 7 ^^ 0 



I yi{j,k,l) 



: otherwise 



ri{j,k) = < 



T — 1 

E 'yt(i) S{at,k) 5{st,j) n 

t=l 

T — 1 

E jtii) 3{at,k) 5{stJ) 

t=l 



if denominator ^ 0 



I ri{j,k) 



otherwise 



TTi = 7 l(*) 



Fig. 10. Parameter reestimation for HM-MDP Baum- Welch algorithm 



It is not difficult to see that the HM-MDP Baum- Welch algorithm requires 
only 0{M^T) time and 0{MN^K + M^) storage, which gives a significant re- 
duction when compared with 0{M‘^N‘^T) time and 0{M‘^N‘^K) storage in the 
POMDP approach. 

4 Empirical Studies 

This section empirically examines the POMDP Baum-Welch^ and HM-MDP 
Baum-Welch algorithms in terms of the required data size and time. Experi- 
ments on various model sizes and settings were conducted. The results are quite 
consistent. In the following, some details of a typical run are presented for illu- 
stration. 

4.1 Experimental Setting 

The experimental model is a randomly generated HM-MDP with 3 modes, 10 
states and 5 actions. Note that this HM-MDP is equivalent to a fairly large 

® Chrisman’s algorithm also attempts to learn a minimal possible number of states. 
Here we are concerned only with learning of the model parameters. 
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POMDP, with 30 hidden states, 10 observations and 5 actions. In order to simu- 
late the infrequent mode changes, each mode is set to have a minimum probabi- 
lity of 0.9 in looping back to itself. In addition, each state of the HM-MDP has 
3 to 8 non-zero transition probabilities, and rewards are uniformly distributed 
between rm{s, a) ±0.1. This reward distribution, however, is not disclosed to the 
learning agents. 

The experiments were run with data of various sizes, using the same initial 
model. The model was also randomly generated in the form of HM-MDP. To 
ensure fairness, the equivalent POMDP model was used for POMDP Baum- 
Welch learning. For each data set, the initial model was first loaded, and the 
selected algorithm iterated until the maximum change of the model parameters 
was less than a threshold of 0.0001. After the algorithm terminated, the model 
learning time was recorded, and the model estimation errors were computed. 
The experiment was then repeated for 11 times with different random seeds in 
order to compute the median. 



4.2 Performance Measure 

The HM-MDP and POMDP Baum- Welch algorithms learn a hidden- mode mo- 
del in different representations. To facilitate comparison, all models were first 
converted into POMDP form. Model estimation errors can then be measured 
in terms of the minimum difference between the learned model and the actual 
model. As the state indices for the learned model might be different from the 
actual one, a renumbering of the state indices is needed. In our experiment, an 
indexing scheme that minimizes the sum of the squares of differences on the 
state transition probabilities between the learned and the actual models was 
used (provided the constraints on the observation probabilities are preserved). 
Figure 11 (a) and (b) report respectively the sum of the squares of differences on 
the transition function and on the reward function using this indexing scheme. 

Regarding the computational requirement of the algorithms, the total CPU 
running time was measured on a SUN Ultra I workstation. Table 1 reports the 
model learning time in seconds. 



Table 1. CPU time in seconds 



Approach 


Data Set Size 


1000 


2000 


3000 


4000 


5000 


HM-MDP 


4.60 


18.72 


15.14 


9.48 


10.07 


POMDP 


189.40 


946.78 


2164.20 


3233.56 


4317.19 
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Comparing POMDP and HM-MDP Baum-Welch 




Window Size 

(a) Error in Transition Function 



Comparing POMDP and HM-MDP Baum-Welch 




Window Size 

(b) Error in Reward Function 



Fig. 11. Model learning errors in terms of the transition probabilities and rewards. 
All environment models are in their POMDP form for comparison. The errors are 
measured by summing the squares of differences on the state transition probabilities 
and on the reward function respectively. 
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4.3 Empirical Results 

Conclusion can now be drawn. Generally speaking, both algorithms can learn a 
more accurate environment model as the data size increases (Figure 11). This 
result is not surprising since both algorithms are statistically based, and hence 
their performances rely largely on the amount of data provided. When the trai- 
ning data size is too small (less than 1000 in this case), both algorithms perform 
about equally poorly. However, as the data size increases, HM-MDP Baum- Welch 
improves substantially faster than POMDP Baum- Welch. 

Our experiment reveals that HM-MDP Baum- Welch was able to learn a fairly 
accurate environment model with a data size of 2500. POMDP Baum- Welch, 
on the contrary, needs a data size of 20000 (not shown) in order to achieve a 
comparable accuracy. In fact, in all the experiments we conducted, HM-MDP 
Baum- Welch always required a much smaller data set than the POMDP Baum- 
Welch. We believe that this result holds in general because in most cases, an 
HM-MDP consists of fewer free parameters than its POMDP counterpart. 

In terms of computational requirement, HM-MDP Baum- Welch is much fa- 
ster than POMDP Baum- Welch (Table 1). We believe this is also true in general 
for the same reason described above. In addition, computational time is not 
necessarily monotonically increasing with the data size. It is because the total 
computational requirements depend not only on the data size, but also on the 
number of iterations being executed. From our experiments, we notice that the 
number of iterations tends to decrease as the data size increases. 

5 Solving Hidden-Mode Problems 

In this section we describe briefly how hidden-mode problems can be solved. 
Since HM-MDPs are a special class of POMDPs, a straightforward approach is 
first to convert a hidden- mode problem into a POMDP one and subsequently to 
apply POMDP algorithms. Nevertheless, POMDP algorithms do not exploit the 
special structures of HM-MDPs. Herein, we develop a value iteration approach 
for HM-MDPs^. The main idea of the algorithm is to exploit the HM-MDP 
structure by decomposing the value function based on the state space. This 
algorithm is akin to the POMDP value iteration conceptually but is much more 
efficient. 



5.1 Value Iteration for HM-MDPs 

Many POMDP algorithms maintain a probability distribution over the state 
space known as belief state. As a result, a policy is a mapping from belief sta- 
tes to actions. In HM-MDPs, a probability distribution over the mode space is 
maintained. We name it belief mode. Unlike the POMDP approach, the belief 
mode divides the belief states into an observable part (i.e. the observable states) 
and an unobservable part (i.e. the hidden modes). This representation minimizes 

A more detailed description of the algorithm can be found in (Choi 6). 
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the number of hidden variables. For every observed state and action, the belief 
mode b is updated as follows. 



Pr(w'|TO) • Pr(s'|m, s, a) ■ b(m) 
Em'eQ EmeQ Pr(m'lm) ■ Pr(s'|m, s, a) ■ b(m) 
EmGQ 2^™'™ ■ 

Em'GQ EmGQ 2^™'™ • «') ■ 



where b'^,(rn') is the probability of being in mode m', given the action a and 
the state s' . The numerator computes the likelihood of the next mode and the 
denominator is the normalization factor. 

An HM-MDP policy now becomes a mapping from belief modes and states 
to actions. Specifically, an optimal action for a belief mode b and the current 
state s maximizes the following value function: 



V{b,s) =max(^ r„(s, a) • 6(m) + 7 ^ Pr(s'|6, s, a)P(&“,, s')) (1) 

aeA ^ ^ ' 
mGQ s'GS 

where rm(s,a) is the reward function and 7 is the discount factor. Since s is a 
discrete variable, one can view the value function V as [S'! value functions. Note 
that these decomposed value functions are still piecewise linear and convex. One 
can therefore use Equation (1) for the value iteration to compute the optimal 
value function for each state in S. 



5.2 Empirical Results 

We implement Equation (1) by using a variant of incremental pruning (Zhang 
and Liu 31; Cassandra et al. 4) and run the program on a SPARC Ultra 2 
machine. A number of experiments were conducted based on some simplified 
real-world problems. The descriptions of the problems can be found in (Choi 6). 
Table 2 gives a summary of the tasks. In these experiments, a discount factor of 
0.95 and a solution tolerance of 0.1 were used. Table 3 reports the CPU time in 
seconds, the number of vectors in the resulting value functions, and the number 
of required epochs. While incremental pruning is considered as one of the most 
efficient POMDP algorithms, our experiment showed that it is unable to solve 
any of these problems within the specified time limit. 

There are two main reasons why the HM-MDP approach is more efficient 
than the POMDP one. First, the dimension of the vector is significantly reduced 
from IQ I • I S' I to |Q|. Second, the most time-consuming part of the algorithm, 
namely the cross sum operation, no longer needs to be performed on the whole 
value function due to decomposition of the value function. 
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Table 2. HM-MDP problems 



Problem 


Modes 


States 


Actions 


Traffic Light 


2 


8 


2 


Sailboat 


4 


16 


2 


Elevator 


3 


32 


3 



Table 3. Solving HM-MDP problems by using incremental pruning 



Problem 


POMDP Approach 


HM-MDP Ap] 


3roach 


Time 


Vectors 


Epochs 


Time 


Vectors 


Epochs 


Traffic Light 


>259200 


- 


- 


4380 


404 


114 


Sailboat 


>259200 


- 


- 


170637 


1371 


112 


Elevator 


>259200 


- 


- 


186905 


3979 


161 



6 Discussions 

The usefulness of a model depends on the validity of the assumptions made. In 
this section, we revisit the assumptions of HM-MDP, discuss the issues involved, 
and shed some light on its applicability to real-world nonstationary tasks. Some 
possible extensions are also discussed. 



A Finite Number of Environment Modes 

MDP is a flexible framework that has been widely adopted in various applicati- 
ons. Among these there exist many tasks that are nonstationary in nature and 
are more suitable to be characterized by several, rather than a single, MDPs. The 
introduction of distinct MDPs for modeling different modes of the environment 
is a natural extension to those tasks. 

One advantage of having distinct MDPs is that the learned model is more 
comprehensible: each MDP naturally describes a mode of the environment. In 
addition, this formulation facilitates the incorporation of prior knowledge into 
the model initialization step. 



Partially Observable Modes 

While modes are not directly observable, they may be estimated by observing the 
past state transitions. It is a crucial, and fortunately still reasonable, assumption 
that one needs to make. 

Although states are assumed to be observable, it is possible to extend the 
model to allow partially observable states, i.e., to relax the second condition 
mentioned in Section 1. In this case, the extended model would be equivalent 
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in representational power to a POMDP. This could be proved by showing the 
reformulation of the two models in both directions. 



Modes Evolving as a Markov Process 

This property may not always hold for all real-world tasks. In some applicati- 
ons, such as learning in a multi-agent environment or performing tasks in an 
adversary environment, the agent’s actions might affect the state as well as the 
environment mode. In that case, an MDP instead of a Markov chain should be 
used to govern the mode transition process. Obviously, the use of a Markov chain 
involves fewer parameters and is thus preferable whenever possible. 



Infrequent Mode Transitions 

This is a property that generally holds in many applications. In order to cha- 
racterize this property, a large transition probability for a mode looping back 
to itself can be used. Note that this is introduced primarily from a practical 
point of view, but is not a necessary condition for our model. In fact, we have 
tried to apply our model-learning algorithms to problems in which this property 
does not hold. We find that our model still outperforms POMDP, although the 
required data size is typically increased for both cases. 

Using high self-transition probabilities to model rare mode changes may not 
always be the best option. In some cases mode transitions are also correlated 
with the time of a day (e.g. busy traffic in lunch hours). In this case, time (or the 
mode sequence) should be taken into account for identifying the current mode. 
One simple way to model this property is to strengthen left-to-right transitions 
between modes, as in the left-to-right HMMs. 



Small Number of Modes 

This nice property significantly reduces the number of parameters in HM-MDP 
compared to that in POMDP, and makes the former more applicable to real- 
world nonstationary tasks. 

The number of states can be determined by the learning agent. States can be 
distinguished by, for instance, transition probabilities, mean rewards, or utilities. 
McCallum (19) has detailed discussions on this issue. 

Likewise, the number of modes can be defined in various ways. After all, mo- 
des are used to discern changes of environment dynamics from noise. In practice, 
introducing a few modes is sufficient for boosting the system performance. More 
modes might help further, but not necessarily significantly. A trade-off between 
performance and response time must thus be decided. In fact, determining the 
optimal number of modes is an important topic that deserves further studies. 
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7 Future Work 

There are a number of issues that need to be addressed in order to broaden 
the applicability of HM-MDPs. First, the number of modes is currently assumed 
to be known. In some situations, choosing the right number of modes can be 
difficult. Hence, we are now investigating the possibility of using Chrisman’s or 
McCallum’s hidden-state-splitting techniques (Chrisman 8; McCallum 18) to re- 
move this limitation. Next, the problem-solving algorithm we presented here is 
preliminary. Further improvement, such as incorporating the point-based impro- 
vement technique (Zhang et al. 30), can be achieved. We are also investigating 
an algorithm that further exploits the characteristics of the HM-MDP. We will 
present this algorithm in a separate paper. Finally, the exploration-exploitation 
issue is currently ignored. In our future work, we will address this important 
issue and apply our model to real-world nonstationary tasks. 



8 Summary 

Making sequential decisions in nonstationary environments is common in real- 
world problems. In this paper we presented a formal model, called hidden-mode 
Markov decision process, for a broad class of nonstationary sequential decision 
tasks. The proposed model is based on five properties observed in a special 
type of nonstationary environments, and is applicable to many traffic control 
type problems. Basically, the hidden-mode model is defined as a fixed number 
of partially observable modes, each of which specifies an MDP. While state and 
action spaces are fixed across modes, the transition and reward functions may 
differ according to the mode. In addition, the mode evolves according to a Markov 
chain. 

HM-MDP is a generalization of MDP. In addition to the basic MDP charac- 
teristics, HM-MDP also allows the model parameters to change probabilistically. 
This feature is important because many real-world tasks are nonstationary in 
nature and cannot be represented accurately by a fixed model. Nevertheless, the 
hidden-mode model also adds uncertainty to the model parameters and makes 
the problem, in general, more difficult than the MDPs. 

HM-MDP is a specialization of POMDP; it can always be transformed into a 
POMDP with an augmented state space. While POMDPs are superior in terms 
of representational power, HM-MDPs require fewer parameters, and therefore 
can provide a more natural formulation for certain type of nonstationary pro- 
blems. Our experiments also show that this simplification significantly speeds 
up both the model-learning and the problem-solving procedures of HM-MDPs. 
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1 Introduction 

Reinforcement Learning (RL) procedures have been established as powerful and 
practical methods for solving Markov Decision Problems. One of the most sig- 
nificant and actively investigated RL algorithms is Q-learning (Watkins, 1989). 
Q- learning is an algorithm for learning to estimate the long-term expected reward 
for a given state-action pair. It has the nice property that it does not need a 
model of the environment, and it can be used for on-line learning. A number 
of powerful convergence proofs have been given showing that Q-learning is gu- 
aranteed to converge, in cases where the state space is small enough so that 
lookup table representations can be used (Watkins and Dayan, 1992). Further- 
more, in large state spaces where lookup table representations are infeasible, RL 
methods can be combined with function approximators to give good practical 
performance despite the lack of theoretical guarantees of convergence to optimal 
policies. 

Most real-world problems are not fully Markov in nature - they are often 
non-stationary, history-dependent and/or not fully observable. In order for RL 
methods to be more generally useful in solving such problems, they need to be 
extended to handle these non-Mar kovian properties. One important application 
domain where the non-Markovian aspects are paramount is the area of multi- 
agent systems. This area is expected to be increasingly important in the future, 
due to the potential rapid emergence of “agent economies” consisting of large 
populations of interacting software agents engaged in various forms of economic 
activity. The problem of multiple agents simultaneously adapting is in general 
non-Markov, because each agent provides an effectively non-stationary environ- 
ment for the other agents. Hence the existing convergence guarantees do not 
hold, and in general, it is not known whether any global convergence will be 
obtained, and if so, whether such solutions are optimal. 

Some progress has been made in analyzing certain special case multi-agent 
problems. For example, the problem of “teams,” where all agents share a common 
utility function, has been studied by Crites and Barto (1996), and by Stone and 
Veloso (1999). Likewise, the purely competitive case of zero-sum utility functions 
has been studied in Littman (1994), where an algorithm called “minimax-Q” was 
proposed for two-player zero-sum games, and shown to converge to the optimal 
value function and policies for both players. Sandholm and Crites studied si- 
multaneous Q-learning by two players in the Iterated Prisoner’s Dilemma game 
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(Sandholm and Crites, 1995), and found that the learning procedure generally 
converged to stationary solutions. However, the extent to which those soluti- 
ons were “optimal” was unclear. Recently, Hu and Wellman (1998) proposed 
an algorithm for multi-agent Q-learning in two-player general-sum games. This 
algorithm is an important first step, but does not yet appear to be useable for 
practical problems, because it assumes that policies followed by both players will 
be Nash equilibrium policies. This requires unbounded rationality of all agents, 
as well as full knowledge of the state transitions and of the utility functions of all 
players. Furthermore, the potential problem of choosing from amongst multiple 
Nash equilibria is not addressed. This could be a serious problem, since Nash 
equilibria tend to proliferate when there is sufficiently high emphasis on future 
rewards, i.e., a large value of the discount parameter 7 (Kreps, 1990). 

The present work examines simultaneous Q-learning in an economically mo- 
tivated two-player game. The players are assumed to be two sellers of similar or 
identical products, who compete against each other on the basis of price. At each 
time step, the sellers alternately take turns setting prices, taking into account 
the other seller’s current price. After the price has been set, the consumers then 
respond instantaneously and deterministically, choosing either seller I’s product 
or seller 2’s product (or no product) based on the current price pair (pi,P2), 
leading to an instantaneous reward or profit {Ui,U 2 ) given to sellers 1 and 2 
respectively. Although not required by Q-learning, it is assumed for purposes of 
comparing with game-theoretic solutions that both sellers have full knowledge 
of the expected consumer response for any given price pair, and in fact have full 
knowledge of both profit functions. 

This work builds on prior research reported in Tesauro and Kephart (2000), 
which examined the effect of including foresight, i.e. an ability to anticipate 
longer-term consequences of an agent’s current action. Two different algorithms 
for agent foresight were presented: (i) a generalization of the minimax search 
procedure in two-player zero-sum games; (ii) a generalization of the Policy Ite- 
ration method from dynamic programming, in which both players’ policies are 
simultaneously improved, until self-consistent policy pairs are obtained that op- 
timize expected reward over two time steps. It was found that including foresight 
in the agents’ pricing algorithms generally improved overall agent profitability, 
and usually damped out or eliminated the pathological behavior of unending 
cyclic “price wars,” in which long episodes of repeated undercutting amongst 
the sellers alternate with large jumps in price. Such price wars were found to be 
rampant in prior studies of agent economy models (Kephart, Hanson and Sai- 
ramesh, 1998; Sairamesh and Kephart, 1998) when the agents use “myopically 
optimal” or “myoptimal” pricing algorithms that optimize immediate reward, 
but do not anticipate the longer-term consequences of an agent’s current price 
setting. 

There are three primary motivations for studying simultaneous Q-learning. 
First, if Q-functions can be learned simultaneously and self-consistently for both 
players, the policies implied by those Q-functions should be self-consistently op- 
timal. (The meaning of “optimal” here and throughout this chapter is that the 
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Q-function accurately represents discounted long-term reward obtained by follo- 
wing the Q-derived policy, and that no other policy can obtain superior long-term 
discounted reward.) In other words, an agent will be able to correctly anticipate 
the longer-term consequences of its own actions, the other agents’ actions, and 
will correctly model the other agents as having an equivalent capability. Hence 
the classic problem of infinite recursion of opponent models will be avoided. In 
contrast, in other approaches to adaptive multi-agent systems, these issues are 
more problematic. For example, Hu and Wellman (1996) study the situation 
of a single “strategic” agent, which is able to anticipate the market impact of 
its pricing actions, in a population of “reactive” agents, which have no such 
anticipatory capability. Likewise, Vidal and Durfee (1998) propose a recursive 
opponent modeling scheme, in which level-0 agents do no opponent modeling, 
level-1 agents model the opponents as being level-0, level-2 agents model the op- 
ponents as being level-1, etc.. In both of these approaches, there is no effective 
way for an agent to model other agents as being at an equivalent level of depth 
or complexity. 

The second advantage of Q-learning is that the solutions should correspond 
to deep lookahead: in principle, the Q-function represents the expected reward 
looking infinitely far ahead in time, exponentially weighted by a discount para- 
meter 0 < 7 < 1. In contrast, the prior work of Tesauro and Kephart (2000) was 
based on shallow finite lookahead. Finally, in comparison to directly modeling 
agent policies, the Q-function approach seems more extensible to the situation of 
very large economies with many competing sellers. Approximating Q-functions 
with nonlinear function approximators such as neural networks seems intuitively 
more feasible than approximating the corresponding policies. Furthermore, in the 
Q-function approach, each agent only needs to maintain a single Q-function for 
itself, whereas in the policy modeling approach, each agent needs to maintain a 
policy model for every other agent; the latter seems infeasible when the number 
of sellers is large. 

The remainder of this chapter is organized as follows. Section 2 describes 
the structure and dynamics of the model two-seller economy, and presents three 
economically-based models of seller profit (Price-Quality, Information-Filtering, 
and Shopbot) which are known to be prone to price wars when agents myopi- 
cally optimize their short-term payoffs. System parameters are chosen to place 
each of these systems in a price-war regime. Section 3 describes implementa- 
tion details of Q-learning in these model economies. As a first step, the simple 
case of ordinary Q-learning is considered, where one of the two sellers uses Q- 
learning and the other seller uses a fixed pricing policy (the myopically optimal, 
or “myoptimal” policy). Section 4 examines the more interesting and novel si- 
tuation of simultaneous Q-learning by both sellers. Section 5 studies single-agent 
Q-learning in these models using neural networks, and compares the results to 
those of section 3 using lookup tables. Finally, section 6 summarizes the main 
conclusions and discusses promising directions and challenges for future work. 
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2 Model Agent Economies 

Real agent economies are likely to contain large numbers of agents, with complex 
details of how the agents behave and interact with each other on multiple time 
scales. In order to make initial progress, a number of simplifying assumptions 
are made. The economy is restricted to two competing sellers, offering similar or 
identical products to a large population of consumer agents. The sellers compete 
on the basis of price, and it is assumed that prices are discretized and can 
lie between a minimum and maximum price, such that the number of possible 
prices is at most a few hundred. This renders the state space small enough that 
it is feasible to use lookup tables to represent the agents’ pricing policies and 
expected utilities. Time in the simulation is also discretized; at each time step, 
the consumers compare the current prices of the two sellers, and instantaneously 
and deterministically choose to purchase from at most one seller. Hence at each 
time step, for each possible pair of seller prices, there is a deterministic reward 
or profit given to each seller. The simulation can iterate forever, and there may 
or may not be a discounting factor for the present value of future rewards. 

It is worth noting that the consumers are not regarded as “players” in the 
model. The consumers have no strategic role: they behave according to an ex- 
tremely simple, fixed, short-term greedy rule (buy the lowest priced product at 
each time step), and are regarded as merely providing a stationary environment 
in which the two sellers can compete in a two-player game. This is clearly a sim- 
plifying first step in the study of multi-agent phenomena, and in future work, 
the models will be extended to include strategic and adaptive behavior on the 
part of the consumers as well. This will change the notion of “desirable” system 
behavior. In the present model, desirable behavior would resemble “collusion” 
between the two sellers in charging very high prices, so that both could obtain 
high profits. Obviously this is not desirable from the consumers’ viewpoint. 

Regarding the dynamics of seller price adjustments, it is assumed that the 
sellers alternately take turns adjusting their prices, rather than simultaneously 
setting prices (i.e. the game is extensive form rather than normal form). The 
choice of alternating-turn dynamics is motivated by two considerations: (a) As 
the number of sellers becomes large and the model becomes more realistic, it 
seems more reasonable to assume that the sellers will adjust their prices at 
different times rather than at the same time, although they probably will not 
take turns in a well-defined order, (b) With alternating-turn dynamics, one can 
stay within the normal Q-learning framework where the Q-function implies a 
deterministic optimal policy: it is known that in two-player alternating turn 
games, there always exists a deterministic policy that is as good as any non- 
deterministic policy (Littman, 1994). In contrast, in games with simultaneous 
moves (for example, rock-paper-scissors), it is possible that no deterministic 
policy is optimal, and that the existing Q-learning formalism for MDPs would 
have to be modified and extended so that it could yield non-deterministic optimal 
policies. 

Q-learning is studied in three different economic models that have been de- 
scribed in detail elsewhere Sairamesh and Kephart, 1998; Kephart, Hanson and 
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Sairamesh, 1998; Greenwald and Kephart, 1999). The first model, called the 
“Price-Quality” model (Sairamesh and Kephart, 1998), models the sellers’ pro- 
ducts as being distinguished by different values of a scalar “quality” parameter, 
with higher-quality products being perceived as more valuable by the consumers. 
The consumers are modeled as trying to obtain the lowest-priced product at each 
time step, subject to threshold-type constraints on both quality and price, i.e., 
each consumer has a maximum allowable price and a minimum allowable qua- 
lity. The similarity and substitutability of seller products leads to a potential for 
direct price competition; however, the “vertical” differentiation due to differe- 
ning quality values leads to an asymmetry in the sellers’ profit functions. It is 
believed that this asymmetry is responsible for the unending cyclic price wars 
that emerge when the sellers employ myoptimal pricing strategies. 

The second model is an “Information-Filtering” model described in detail in 
Kephart, Hanson and Sairamesh (1998). In this model there are two competing 
sellers of news articles in somewhat overlapping categories. In contrast to the 
vertical differentiation of the Price-Quality model, this model contains a hori- 
zontal differentiation in the differing article categories. To the extent that the 
categories overlap, there can be direct price competition, and to the extent that 
they differ, there are asymmetries introduced that again lead to the potential 
for cyclic price wars. 

The third model is the so-called “Shopbot” model described in Greenwald 
and Kephart (1999), which is intended to model the situation on the Internet in 
which some consumers may use a Shopbot to compare prices of all sellers offering 
a given product, and select the seller with the lowest price. In this model, the 
sellers’ products are exactly identical and the profit functions are symmetric. 
Myoptimal pricing leads the sellers to undercut each other until the minimum 
price point is reached. At that point, a new price war cycle can be launched, 
due to buyer asymmetries rather than seller asymmetries. The fact that not all 
buyers use the Shopbot, and some buyers instead choose a seller at random, 
means that it can be profitable for a seller to abandon the low-price competition 
for the bargain hunters, and instead maximally exploit the random buyers by 
charging the maximum possible price. 

An example economic profit function, taken from the Price-Quality model, 
is as follows: Let p\ and p 2 represent the prices charged by seller 1 and seller 
2 respectively. Let q\ and (j 2 represent their respective quality parameters, with 
qi > < 72 - Let c{q) represent the cost to a seller of producing an item of quality 
q. Then the profits per item sold by sellers 1 and 2 are given by pi — c{qi) and 
P 2 — c(< 72 ) respectively. Now, according to the model of consumer behavior de- 
scribed in Sairamesh and Kephart (1998), each consumer has a minimum quality 
threshold and a maximum price threshold, and these threshold values are uni- 
formly distributed between 0 and 1 throughout the consumer population. This 
implies that in the absence of competition, the market share obtained by seller 
i will be given by (qi — Pi)', this represents the elimination of consumers with 
quality thresholds greater than qi and price thresholds less than pi. Gombining 
the market share calculation with the profit per item calculation, one can show 
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analytically that in the limit of infinitely many consumers, the instantaneous 
utilities (profits per consumer) U\ and U2 obtained by seller 1 and seller 2 res- 
pectively are given by: 



f (91 - Pi){Pi - c(9i)) if 0 < Pi < P2 or Pi > q 2 
\ (91 - 92KP1 - c(9i)) if P2<Pi< 92 



u = / -P2){P2 - 0(92)) if 0 < P2 < Pi ^ 2 ) 

^ \ 0 if P 2 > Pi ^ 

A plot of the profit landscape for seller 1 as a function of prices pi and p2 
is given in figure 1, for the following parameter settings: qi = 1.0, 52 = 0.9, 
and c{q) = 0.1(1 -I- q). (These specific parameter settings were chosen because 
they are known to generate harmful price wars when the agents use myoptimal 
policies.) In this figure, the myopic optimal price for seller 1 as a function of seller 
2’s price, Pi(p2), is obtained for each value of p2 by sweeping across all values 
of Pi and choosing the value that gives the highest profit. For small values of 
P2, the peak profit is obtained at pi = 0.9, whereas for larger values of p2, there 
is eventually a discontinuous shift to the other peak, which follows along the 
parabolic-shaped ridge in the landscape. An analytic expression for the myopic 
optimal price for seller 1 as a function of p2 is as follows (defining xi = qi + c{qi) 
and X2 = 92 + 0(92)): 



{ 92 if 0 < p2 < xi - q 2 
P2 if a;i - 92 < P2 < (3) 

5X1 if p2 > 5CC1 

Similarly, the myopic optimal price for seller 2 as a function of the price set 
by seller 1, p^ipi), is given by the following formula (assuming that prices are 
discrete and that e is the price discretization interval): 



( c{q2) 0 < Pi < c{q2) 

P 2 (Pi)= < Pi - e if c{q 2 ) < Pi < 5X2 (4) 

[ ^X 2 if Pi > 5X2 

There are also similar profit landscapes for each seller in the Information- 
Filtering model and in the Shopbot model. In all three models, it is the existence 
of multiple, disconnected peaks in the landscapes, with relative heights that can 
change depending on the other seller’s price, that leads to price wars when the 
sellers behave myopically. 

In these models it is assumed for simplicity that the players have essentially 
perfect information. They can model the consumer behavior perfectly, and they 
also have perfect knowledge of each other’s costs and profit functions. Hence the 
model is in essence a two-player perfect-information deterministic game, similar 
to games like chess. The main differences are that the utilities are not strictly 
zero-sum, there are no terminating or absorbing nodes in the state space, and 
payoffs are given to the players at every time step, rather than at terminating 
nodes. 
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Fig. 1. Sample profit landscape for seller 1 in Price-Quality model, as a function of 
seller 1 price pi and seller 2 price P2- 



As mentioned previously, the possible seller prices are constrained to lie in a 
range from some minimum to maximum allowable price. The prices are discreti- 
zed, so that one can create lookup tables for the seller profit functions C/(pi,p2)- 
Furthermore, the optimal pricing policies for each seller as a function of the 
other seller’s price, _p*(_P2) and P2{pi)i can also be represented in the form of 
table lookups. 

3 Single-Agent Q-Learning 

Let us first consider ordinary single-agent Q-learning vs. a fixed myoptimal op- 
ponent pricing strategy in the above two-seller economic models. The procedure 
for Q-learning is as follows. Let Q(s,a) represent the discounted long-term ex- 
pected reward to an agent for taking action a in state s. The discounting of 
future rewards is accomplished by a discount parameter 7 such that the value 
of a reward expected at n time steps in the future is discounted by 7". Assume 
that the Q(s, a) function is represented by a lookup table containing a value for 
every possible state-action pair, and assume that the table entries are initialized 
to arbitrary values. Then the procedure for solving for Q(s,a) is to infinitely 
repeat the following two-step loop: 

1 . Select a particular state s and a particular action a, observe the immediate 
reward r for this state-action pair, and observe the resulting state s'. 

2. Adjust Q(s,a) according to the following equation: 

AQ{s, a) = a[r + 7maxQ(s', b) — Q{s, a)] 



( 5 ) 
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where a is the learning rate parameter, and the max operation represents choo- 
sing the optimal action b among all possible actions that can be taken in the 
successor state s' leading to the greatest Q- value. A wide variety of methods may 
be used to select state-action pairs in step 1, provided that every state-action 
pair is visited infinitely often. For any stationary Markov Decision Problem, the 
Q-learning procedure is guaranteed to converge to the correct values, provided 
that a is decreased over time with an appropriate schedule. 

In this pricing application, the distinction between states and actions is so- 
mewhat blurred. It is assumed here that the “state” for each seller is sufficiently 
described by the other seller’s last price, and that the “action” is the current price 
decision. This should be a sufficient state description because no other history is 
needed either for the determination of immediate reward, or for the calculation 
of the myoptimal price by the fixed-strategy player. The definitions of immediate 
reward r and next-state s' have also been modified for the two-agent case. The 
next state s' is defined as the state that is obtained, starting from s, of one 
action by the Q-learner and a response action by the fixed-strategy opponent. 
Likewise, the immediate reward is defined as the sum of the two rewards ob- 
tained after those two actions. These modifications were introduced so that the 
state s' would have the same player to move as state s. (A possible alternative 
to this, which was not investigated, is to include the side-to-move as additional 
information in the state-space description.) 

In the simulations reported below, the sequence of state-action pairs selected 
for the Q-table updates were generated by uniform random selection from amon- 
gst all possible table entries. The initial values of the Q-tables were generally 
set to the immediate reward values. (Consequently the initial Q-derived policies 
corresponded to myoptimal policies.) The learning rate was varied with time 
according to: 



a{t) = a{0) / {1 + (3t) (6) 

where the initial learning rate o;(0) was usually set to 0.1, and the constant 
(3 ~ 0.01 when the simulation time t was measured in units of N'^, the size of 
the Q-table. {N is the number of possible prices that could be selected by either 
player.) A number of different values of the discount parameter 7 were studied, 
ranging from 7 = 0 to 7 = 0.9. 

Results for single-agent Q-learning in all three models indicated that Q- 
learning worked well (as expected) in each case. In each model, for each value of 
the discount parameter, exact convergence of the Q-table to a stationary optimal 
solution was found. The convergence times ranged from a few hundred sweeps 
through each table element, for smaller values of 7, to at most a few thousand 
updates for the largest values of 7. In addition, the profitability of the converged 
Q-function’s policy was measured by running the Q-policy against the other 
player’s myopic policy from 100 random starting states, each for 200 time steps, 
and averaging the resulting cumulative profit for each player. It was found that, 
in each case, the seller achieved greater profit against a myopic opponent by 
using a Q-derived policy than by using a myopic policy. (This was true even 
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for 7 = 0, because, due to the redefinition of Q updates summing over two 
time steps, the case 7 = 0 effectively corresponds to a two-step optimization, 
rather than the one-step optimization of the myopic policies.) Furthermore, the 
cumulative profit obtained with the Q-derived policy monotonically increased 
with the increasing 7 (as expected). 




Y Pi 

Fig. 2. Results of single-agent Q-learning in the Shopbot model, (a) Average profit 
per time step for Q-learner (seller 1, filled circles) and myopic seller (seller 2, open 
circles) vs. discount parameter 7 . Dashed line indicates baseline expected profit when 
both sellers are myopic, (b) Cross-plot of Q-derived price curve (seller 1) vs. myopic 
price curve (seller 2) at 7 = 0.5. Dashed line and arrows indicate a temporal price-pair 
trajectory using these policies, starting from filled circle. 



It was also interesting to note that in many cases, the expected profit of 
the myopic opponent also increased when playing against the Q-learner, and 
also improved monotonically with increasing 7 . The explanation is that, rather 
than better exploiting the myopic opponent, as would be expected in a zero-sum 
game, the Q-learner instead reduced the region over which it would participate 
in a mutually undercutting price war. Typically one finds in these models that 
with myopic vs. myopic play, large-amplitude price wars are generated that start 
at very high prices and persist all the way down to very low prices. When a Q- 
learner competes against a myopic opponent, there are still price wars starting at 
high prices, however, the Q-learner abandons the price war more quickly as the 
prices decrease. The effect is that the price- war regime is smaller and confined 
to higher average prices, leading to a closer approximation to cooperative or 
collusive behavior, with greater expected utilites for both players. 

An illustrative example of the results of single-agent Q-learning is shown in 
figure 2. Figure 2(a) plots the average profit for both sellers in the Shopbot 
model, when one of the sellers is myopic and the other is a Q-learner. (As the 
model is symmetric, it doesn’t matter which seller is the Q-learner.) Figure 2(b) 
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plots the myopic price curve of seller 2 against the Q-derived price curve (at 
7 = 0.5) of seller 1. One can see that both curves have a maximum price of 1 
and a minimum price of approximately 0.58. The portion of both curves lying 
along the diagonal indicates undercutting behavior, in which case the seller will 
respond to the opponent’s price by undercutting by e, the price discretization 
interval. 

Given any initial state (pi,p 2 ), the system dynamics in figure 2(b) can be 
obtained by alternately applying the two pricing policies. This can be done by 
a simple iterative graphical construction, in which one first holds P 2 constant 
and moves horizontally to the pi{p 2 ) curve, and then one holds pi constant and 
moves vertically to the P 2 {pi) curve. We see in this figure that the iterative 
graphical construction leads to an unending cyclic price war, whose trajectory 
is indicated by the dashed line. Note that the price- war behavior begins at the 
price pair ( 1 , 1 ), and persists until a price of approximately 0.83. At this point, 
seller 1 abandons the price war, and resets its price to 1 , leading once again to 
another round of undercutting. 

The amplitude of this price war is diminished compared to the situation in 
which both players use a myopic policy. In that case, seller I’s curve would be a 
mirror image of seller 2 ’s curve, and the price war would persist all the way to 
the minimum price point, leading to a lower expected profit for both sellers. 

4 Multi-agent Q-Learning 

Let us now examine the more interesting and challenging case of simultaneous 
training of Q-functions and policies for both sellers. The approach is to use the 
same formalism presented in the previous section, and to alternately adjust a 
random entry in seller I’s Q-function, followed by a random entry in seller 2’s 
Q-function. As each seller’s Q-function evolves, the seller’s pricing policy is cor- 
respondingly updated so that it optimizes the agent’s current Q-function. In 
modeling the two-step payoff r to a seller in equation 5, the opponent’s current 
policy is used, as implied by its current Q-function. The parameters in the expe- 
riments below were generally set to the same values as in the previous section. 
In most of the experiments, the Q-functions were initialized to the immediate- 
reward values (so that the policies corresponded to myopic policies), although 
other initial conditions were explored in a few experiments. 

For simultaneous Q-learning in the Price-Quality model, robust convergence 
to a unique pair of pricing policies is obtained, independent of the value of 7 , as 
illustrated in figure 3(b). This solution also corresponds to the solution found 
by generalized minimax and by generalized DP in Tesauro and Kephart (2000). 
Note that repeated application of this pair of price curves leads to a dynamical 
trajectory that eventually converges to a fixed-point located at (pi = 0.9, p 2 = 
0.4). A detailed analysis of these pricing policies and the fixed-point solution is 
presented in Tesauro and Kephart (2000). In brief, for sufficiently low prices of 
seller 2 , it pays seller 1 to abandon the price war and to charge a very high price. 
Pi = 0.9. The value of p 2 = 0.4 then corresponds to the highest price that seller 
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Fig. 3. Results of simultaneous Q-learning in the Price-Quality model, (a) Average 
profit per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) vs. 
discount parameter 7 . Dashed line indicates baseline myopic vs. myopic expected profit. 
Note that seller 2’s profit is higher than seller I’s, even though seller 2 has a lower 
quality parameter, (b) Cross-plot of Q-derived price curves (at any 7 ). Dashed line 
and arrows indicate a sample price dynamics trajectory, starting from the filled circle. 
The price war is eliminated and the dynamics evolves to a fixed point indicated by an 
open circle. 
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2 can charge without provoking an undercut by seller 1, based on a two-step 
lookahead calculation (seller 1 undercuts, and then seller 2 replies with a further 
undercut). Note that this fixed point does not correspond to a (one-shot) Nash 
equilibrium, since both players have an incentive to deviate, based on a one-step 
lookahead calculation. 

The cumulative profits obtained by the pair of pricing policies are plotted in 
figure 3(a). It is interesting that seller 2, the lower-quality seller, actually obtains 
a significantly higher profit than seller 1, the higher-quality seller. In contrast, 
with myopic vs. myopic pricing, seller 2 does worse than seller 1. 

In the Shopbot model, exact convergence of the Q-functions was not found 
for all values of 7. However, in those cases where exact convergence was not 
found, there is very good approximate convergence, in which the Q-functions 
and policies converged to stationary solutions to within small random fluctuati- 
ons. Different solutions were obtained at each value of 7. Symmetric solutions, 
in which the shapes of pi{p2) and P2{pi) are identical, are generally obtained 
obtained at small 7, whereas a broken symmetry solution, similar to the Price- 
Quality solution, is obtained at large 7. There is also a range of 7 values, between 
0.1 and 0.2, where either a symmetric or asymmetric solution could be obtained, 
depending on initial conditions. The asymmetric solution is counter-intuitive, 
because one would expect that the symmetry of the two sellers’ profit functions 
would lead to a symmetric solution. In hindsight, one can apply the same type 
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Fig. 4. Average profit per time step obtained by simultaneous Q-learners seller 1 
(solid diamonds) and seller 2 (open diamonds) in the Shopbot model, as a function 
of discount parameter 7 . Dashed line indicates baseline myopic vs. myopic expected 
profit. 





Fig. 5. Cross-plots of Q-derived pricing policies obtained by simultaneous Q-learning 
in the Shopbot model, (a) Q-derived price curves at 7 = 0; the solution is symmetric. 
Dashed line and arrows indicate a sample price dynamics trajectory, (b) Q-derived 
price curves at 7 = 0.9; the solution is asymmetric. 
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of reasoning as in the Price-Quality model to explain the asymmetric solution. A 
plot of the expected profit for both sellers as a function of 7 is shown in figure 4. 
Plots of the symmetric and asymmetric solution, obtained at 7 = 0 and 7 = 0.9 
respectively, are shown in figure 5. 




Fig. 6. Results of multi-agent Q-learning in the Information-Filtering model, (a) Aver- 
age profit per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) 
vs. discount parameter 7. (The data points at 7 = 0.6, 0.7 represent unconverged 
Q-functions and policies.) Dashed lines indicates baseline expected profit when both 
sellers are myopic, (b) Cross-plot of Q-derived price curves at 7 = 0.5. 



Finally, in the Information-Filtering model, simultaneous Q-learning produ- 
ced exact or good approximate convergence only for small values of 7 (0 < 
7 < 0.5). For large values of 7, no convergence was obtained. The simultaneous 
Q-learning solutions yielded reduced-amplitude price wars, and montonically in- 
creasing profitability for both sellers as a function of 7, at least up to 7 = 0.5. A 
few data points were examined at 7 > 0.5, and even though there was no con- 
vergence, the Q-policies still yielded greater profit for both sellers than in the 
myopic vs. myopic case. A plot of the Q-derived policies and system dynamics 
for 7 = 0.5 is shown in figure 6(b). The expected profits for both players as a 
function of 7 is plotted in figure 6(a). 

5 Q-Learning with Neural Networks 

Using lookup tables to represent Q-functions as described in the previous two sec- 
tions can only be feasible for small-scale problems. It is likely that the situations 
that will be faced by software agents in the real world will be too large-scale and 
complex to tackle via lookup tables, and that some sort of function approxima- 
tion scheme will be necessary. This section examines the use of multi-layer neural 
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networks to represent the Q-functions in the same economic models studied pre- 
viously. Some initial results are presented for the case of a single adaptive QNN 
(Q-learning neural network) agent, training vs. a fixed-strategy myopic agent, 
as was described previously in section 3. 

The neural networks studied here are multi-layer perceptrons (MLPs) as 
used in back-propagation. The same Q-learning equation 5 is used as previously, 
however, the quantity AQ{s, a) is interpreted as the output error signal used 
in a backprop-style gradient calculation of weight changes. As in the previous 
sections, the state-action pairs (s,a) are chosen by uniform random exploration, 
although there is some preliminary evidence that somewhat better policies can be 
obtained by training on actual trajectories. Also in the experiments below, a fixed 
learning-rate constant a = 0.1 was used, rather than the time-varying schedule 
a{t) described previously. This appears to give a significant speed increase at 
the cost of only a slight degradation in final network performance. 

One of the most important issues in using neural networks is the design 
of the input state representation scheme. Schemes that incorporate specialized 
knowledge of the domain can often do better than naive representation schemes. 
The only knowledge included here is that it is important for a seller to know 
whether its price is greater than, less than, or equal to the other seller’s price. 
This suggests a coding scheme using five input units. The first two units represent 
the two seller prices (pi,p2) as real numbers, and the remaining three units are 
binary units representing the three logical conditions [pi < P2], [pi = P2], and 
[Pl >P 2 ]- 

In the experiments below, the networks contained a single linear output unit, 
and a single hidden layer of 10 hidden units, fully connected to the input layer. 
For the Shopbot model, the network appeared to reach peak profitability within 
a few hundred sweeps. However, its ability to approximate the correct Q-function 
after this amount of training was generally poor, with the worst accuracy obtai- 
ned at large values of 7. Furthermore, the improvement in approximation error 
with training was extremely slow, and continued to decrease at an extremely 
slow rate for as long as the training was continued. Typical training runs lasted 
for several tens of thousands of sweeps through all possible price pairs. In the 
case of 7 = 0 an extremely long training run of several million sweeps was per- 
formed, after which the approximation accuracy was quite good, but the error 
was still decreasing at a very slow rate. 

The difficulty in obtaining accurate function approximation could have re- 
sulted from inaccurate targets AQ{s, a) used in training, or it could be due to 
intrinsic limitations in the function approximator itself. In a separate series of 
experiments, neural nets were trained with exact targets provided by the lookup 
tables, and this yielded no measurable advantage in approximation accuracy, 
suggesting that the problem is not due to inaccurate heuristic teacher signals in 
equation 5. 

During the neural net training runs, the expected performance of the neural 
net policy was periodically measured vs. the myopic opponent. While the abso- 
lute error in the Q-function improved monotonically, the policy’s expected profit 
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was found to reach a peak relatively quickly (usually withing 100-200 sweeps, 
but longer for large 7) and then either level off or decrease slightly. A plot of the 
peak expected profit for each training run as a function of 7 is shown in figure 8. 
For comparison, the expected profit of the exact optimal policy, obtained by loo- 
kup table Q-learning, is also plotted. It is encouraging to note that, although the 
absolute accuracy of the neural net Q-function is poor and improves extremely 
slowly, the resulting policies nevertheless give reasonably decent performance 
and can be trained relatively quickly. This once again re-emphasizes a point fo- 
und in other successful applications of neural nets and reinforcement learning: 
the neural net approach can often give a surprisingly strong policy, even though 
the absolute accuracy of the value function is poor. 




Fig. 7. Plot of the Q-function for single-agent Q-learning vs. a myopic opponent in 
the Shopbot model at 7 = 0.7. p_Q is the Q-learner’s price and p_MY is the myopic 
player’s price, (a) Exact optimal Q-function obtained by lookup table Q-learning. (b) 
Neural network approximation to the Q-function. 



An illustration of the quality of neural net function approximation compared 
to an exact lookup table solution is shown in figure 7. (This network was trai- 
ned for 50,000 sweeps through all possible price pairs, and displays significantly 
better function approximation than at the point where peak profitability is first 
reached.) We can see that the neural net approximates the flat and smoothly 
curving portions of the landscape quite well. It also correctly fits the diagonal 
discontinuity along the undercutting linepg < pmy, primarily due to knowledge 
engineering of an additional input feature to represent this discontinuity. Howe- 
ver, the neural net fits poorly the second discontinuity in the low pQ regime; this 
results in a suboptimal policy. 

Results of simultaneous Q-learning in the Shopbot model are plotted in figure 
9. These runs appear to reach peak performance after a few thousand sweeps. 
With further training, the approximation error continues to decrease slowly, but 
no improvement in policy strength is observable. Note that the average profit 
in figure 9 compares quite favorably with figure 4. The Q-derived policies for 
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Fig. 8. Plot of expected profit for a single Q-learning agent, training against a fixed 
myopic opponent, in the Shopbot model, as a function of 7 . Filled circles represent a 
neural network Q-learner, while open circles represent a lookup table Q-learner. Each 
data point represents the best policy obtained during a training run. By comparison, 
the baseline myopic vs. myopic expected profit is indicated by the dashed line. 





Fig. 9. Results of simultaneous neural network Q-learning in the Shopbot model, (a) 
Average profit per time step for seller 1 (solid diamonds) and seller 2 (open diamonds) 
vs. discount parameter 7 . Dashed line indicates baseline myopic vs. myopic expected 
profit, (b) Cross-plot of Q-derived price curves at 7 = 0.9. 
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7 = 0.9 in figure 9 have roughly the same qualitative shape as the curves in 
figure 5 based on lookup tables. Preliminary results for the Information-Filtering 
also demonstrate the same qualitative relationship between neural net policies 
and lookup table policies. (These networks only had 5 hidden units; it seems 
likely that significant improvements could be obtained by increasing the number 
of hidden units.) 



6 Summary 

This chapter has examined single-agent and multi-agent Q-learning in three 
models of a two-seller economy in which the sellers alternately take turns setting 
prices, and then instantaneous utilities are given to both sellers based on the 
current price pair. Such models fall into the category of two-player, alternating- 
turn, arbitrary-sum Markov games, in which both the rewards and the state- 
space transitions are deterministic. The game is Markov because the state space 
is fully observable and the rewards are not history dependent. 

In all three models (Price-Quality, Information-Filtering, and Shopbot), large- 
amplitude cyclic price wars are obtained when the sellers myopically optimize 
their instantaneous utilities without regard to longer-term impact of their pri- 
cing policies. It is found that, in all three models, the use of Q-learning by one of 
the sellers against a myopic opponent invariably results in exact convergence to 
the optimal Q-function and optimal policy against that opponent, for all allowed 
values of the discount parameter 7 . The use of the Q-derived policy yields grea- 
ter expected profit for the Q-learner, with monotonically increasing profit as 7 
increases. In many cases, it has a side benefit of enhancing social welfare by also 
giving greater expected profit for the myopic opponent. This comes about by 
reducing the amplitude of the undercutting price-war regime, or in some cases, 
eliminating it completely. 

The more interesting and challenging situation of simultaneously training Q- 
functions for both sellers has also been studied. This is more difficult because 
as each seller’s Q-function and policy change, it provides a non-stationary envi- 
ronment for adaptation of the other seller. No convergence proofs exist for such 
simultaneous Q-learning by multiple agents. Nevertheless, despite the absence of 
theoretical guarantees, generally good behavior of the algorithm was found. In 
two of the models (Shopbot and Price-Quality), exact or very good approximate 
convergence was obtained to simultaneously self-consistent Q-functions and op- 
timal policies for any value of 7 , whereas in the Information-Filtering model, 
simultaneous convergence was found for 7 < 0.5. In the Information-Filtering 
and Shopbot models, monotonically increasing expected utilities for both sellers 
were also found for small values of 7 . In the Price-Quality model, simultaneous 
Q-learning yields an asymmetric solution, corresponding to the solution found 
in Tesauro and Kephart (2000), that is highly advantageous to the lesser-quality 
seller, but slightly disadvantageous to the higher-quality seller, when compared 
to myopic vs. myopic pricing. A similar asymmetric solution is also found in the 




Pricing in Agent Economies 



305 



Shopbot model for large 7 , even though the profit functions for both players are 
symmetric. 

For each model, there exists a range of discount parameter values where 
the solutions obtained by simultaneous Q-learning are self-consistently optimal, 
and outperform the solutions obtained in Tesauro and Kephart (2000). This is 
presumably because the previously published methods were based on limited 
lookahead, whereas the Q-functions in principle look ahead infinitely far, with 
appropriate discounting. 

It is intruiging that simultaneous Q-learning works well in these models, de- 
spite the lack of theoretical convergence proofs. Sandholm and Crites also found 
that simultaneous Q-learning generally converged in the Iterated Prisoner’s Di- 
lemma game. These empirical findings suggest that a deeper theoretical analysis 
of simultaneous Q-learning may be worth investigating. There may be some 
underlying theoretical principles that can explain why simultaneous Q-learning 
works, for at least certain classes of arbitrary-sum utility functions. 

Some initial steps have also been taken in combining nonlinear function ap- 
proximation, using neural nets, with the Q-learning approach. It was found that 
a single neural net Q-learner facing a myopic opponent can exhibit reasonably 
good pricing policies, despite difficulties in obtaining an accurate approximation 
to the Q-function. Two neural networks performing simultaneous Q-learning 
were also able to obtain pricing policies that match the qualitative characteri- 
stics of lookup table policies, and with profitabilities close to the lookup table 
results. 

In addition to replacing lookup tables with function approximators, several 
other important challenges will also be faced in extending this approach to larger- 
scale, more realistic simulations. First, the three economic models used here quite 
deliberately ignored frictional effects such as agent search costs. Such effects can 
damp out price wars, and can lead to different system behaviors such as partial 
equilibria that support stable price differentiation. Eventually such frictional 
effects will have to be considered, although it has been argued in prior studies of 
these models (Sairamesh and Kephart, 1998; Kephart, Hanson and Sairamesh, 
1998; Greenwald and Kephart, 1999) that frictional effects in Web-based agent 
economies will be considerably smaller than in traditional human economies. 
Also, with many sellers, the concept of sellers taking turns adjusting their prices 
in a well-defined order becomes problematic. This could lead to an additional 
combinatorial explosion, if the mechanism for calculating expected reward has 
to anticipate all possible orderings of opponent responses. 

Furthermore, while these economic models have a moderate degree of realism 
in their profit functions, they are unrealistic in the assumptions of knowledge 
and dynamics. In the work reported here, the state space was fully observable 
infinitely frequently at zero cost and with zero propagation delays. The expec- 
ted consumer response to a given price pair was instantaneous, deterministic 
and fully known to both players. Indeed, the players’ exact profit functions were 
fully known to both players. It was also assumed that the players would alter- 
nately take turns equally often in a well-defined order in adjusting their prices. 
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Under such assumptions of knowledge and dynamics, one could hope to develop 
an algorithm that could calculate in advance something like a game-theoretic 
optimal pricing algorithm for each agent. 

However, in realistic agent economies, it is likely that agents will have much 
less than full knowledge of the state of the economy. Agents may not know the 
details of other agents’ profit functions, and indeed an agent may not know its 
own profit function, to the extent that buyer behavior is unpredictable. The 
dynamics of buyers and sellers may also be more complex, random and unpre- 
dictable than what assumed here. There may also be information delays for both 
buyers and sellers, and part of the economic game may involve paying a cost in 
order to obtain information about the state of the economy faster and more fre- 
quently, and in greater detail. Finally, one may expect that buyer behavior will 
be non-stationary, so that there will be a more complex co-evolution of buyer 
and seller strategies. 

While such real-world complexities are daunting, there are reasons to believe 
that learning approaches such as Q-learning may play a role in practical solu- 
tions. The advantage of Q-learning is that one does not need a model of either 
the instantaneous payoffs or of the state-space transitions in the environment. 
One can simply observe actual rewards and transitions and base learning on 
that. While the theory of Q-learning requires exhaustive exploration of the state 
space to guarantee convergence, this may not be necessary when function ap- 
proximators are used. In that case, after training a function approximator on a 
relatively small number of observed states, it may then generalize well enough on 
the unobserved states to give decent practical performance. Several recent em- 
pirical studies have provided evidence of this (Tesauro, 1995; Crites and Barto, 
1996; Zhang and Dietterich, 1996). 

Acknowledgements. The author thanks Jeff Kephart and Amy Greenwald for 
helpful discussions. 
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1 Introduction 

Serial order is an important aspect of human and animal behavior and con- 
tinues to receive attention. In spite of research over many years, it is not yet 
completely understood how the brain solves the serial order of motor behavior 
problem (Rosenbaum 10). However, in recent years there has been rapid progress 
in the domain of sequence processsing both in the neuroscience literature (elec- 
trophysiology on animals and imaging studies on humans) and in computational 
learning literature. 

In the sequence learning literature, various researchers used neural networks 
(for example, Bapi & Doya 1, Dominey 2, Elman 3, etc.) to learn multiple se- 
quences. Problems with such approaches are long training times and catastrophic 
forgetting. Recently, Wolpert and Kawato (16) proposed multiple paired forward- 
inverse models architecture for human motor learning and successfully applied 
it to the problem of manipulation of multiple objects (Haruno et al. 4). They 
argued that such paired models enable learning and retrieval of the appropriate 
models based on environmental context and facilitate smooth interpolation for 
generalization. These models also avoid the catastrophic forgetting problem since 
simulations demonstrate that the model switching is robust (Haruno et al. 4). 
In the current work we extend this approach to sequence learning. We propose 
a multiple forward model (MEM) architecture for sequence processing. The key 
idea is that sequence learning and switching is based on prediction errors. After 
introducing the details of the proposed architecture, we will review the experi- 
mental findings of Tanji and Shima (15) on monkeys. The aim of the current work 
is to take these experimental results as an example case and test the proposed 
MEM architecture. 

2 MFM Architecture 

A forward model is defined as one that predicts the response output y{t + 1), 
given the inputs x(t) to the system. In the sequence processing context, a forward 
model predicts the next action y{t+ 1), given the previous response y{t). Figure 
1 shows the MFM architecture. 
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Fig. 1. Neural Network Architecture 



In the following we describe the processing steps in order. Firstly, the network 
is presented with the the previous response y{t). Each of the Sequence Modu- 
les predicts the next action y^{t + 1) based on the previous response. These 
predictions are gated by the responsibility factors in the Responsibility Ga- 
ting module to generate a predicted response. The Response Selection module 
combines the predicted response with the input to generate the actual action 
output y{t + 1). Errors resulting from the comparison of the desired target to 
the predicted response of each module are sent to the Prediction Error mo- 
dule. Responsibility Estimation module transforms the prediction errors into 
responsibility factors A^. A^ signify the relative appropriateness of modules in 
the current context and drive hebbian learning in the sequence modules. The 
above processing steps are repeated till the sequence switching becomes stable. 
We now present the detailed equations. 

2.1 Sequence Module Output 

Each sequence module k predicts the next action 1) based on the previous 

response y{t). 
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2.2 Response Selection 

The predicted response is used in two ways. Firstly, it is used to generate the 
actual action output Vj{t + 1). Predicted response from each sequence module 
y^{t) is gated by the corresponding responsibility factor A*(t) and the final 
output is chosen by g{-), either the maximum function or a suitable stochastic 
function. 



Vj{t + l) = g 




(2) 



2.3 Responsibility Estimation 

The second use of the predicted responses is to calculate the responsibility fac- 
tors. 



Prediction Error The prediction error E^{t) of each module is calculated 
based on the predicted response and the desired target y*j*{t). 



error'^it) = {yf{t) - y^(t)) 

3 



( 3 ) 

( 4 ) 



Responsibility We assume that the probability (t) of the module being 
the appropriate predictor is proportional to the exponential of the error E^{t) 
with a scaling factor a{t). 

p^{t) = (5) 

The responsibility estimation module keeps a memory of the previous p^ {t — 
6) values, where S = 0,...,A. Responsibility factor \^{t) is then calculated by 
normalizing the product of all the probabilities. 



xHt) = 

E.nf=oP^{t-6) 



(6) 



The combination of using the exponential function and normalization amo- 
unts to using a softmax operation in the responsibility estimation. The use of 
past probability values ensures that modules are not switched around wildly. 
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2.4 Learning in the Sequence Modules 

Hebbian learning in the sequence modules is gated by the responsibility factor. 
Thus at any time, the module that predicts the best learns the most. Although in 
the current formulation the forward models work as one-step-ahead predictors, 
they can be designed as more general predictors with longer sequential context. 



+ 1) = w%{t) + TjX'^error^it) ■ 

(7) 

In the next section we present the details of the experiments of Tanji and 
Shima on monkeys. The specific equations used to reproduce the results of Tanji 
and Shima are given in the Appendix. The subsequent section presents the si- 
mulation results. 



3 Studies on Monkeys 

Tanji and colleagues devised a sequence learning paradigm and made single cell 
electrode recordings in behaving monkeys while they performed sequential arm 
movements. The recordings have been made in various areas in the anteriror 
part of the neocortex, including the supplementary motor area (SMA), the pre- 
supplementary motor area (preSMA), the premotor cortex (PM), and the motor 
cortex (MI). All these areas are hypothesized to be participating in the lear- 
ning and/or execution of the external-input-driven and/or internally-generated 
sequential motor acts (Tanji 14). After describing the experimental setup, a 
summary of results from various experiments will be given below. 



3.1 Experimental Details 

Monkeys were initially trained to respond with a push, pull or turn movement 
on a handle when presented with a red, yellow, or green color light, respectively, 
on a display panel. After establishing the color-to-action mapping, they were 
required to sequence these elementary actions by observing a succession of color 
lights and recalling the appropriate movement at every step. In this manuscript 
we will refer to the four sequences as 1:123, 11:132, 111:321, and IV:312. The 
trials when color lights preceded movements were called, ‘Visual Trials (VT)’ 
and the other trials when there were no colored lights were called ’Memory 
Trials (MT)’. In a block of trials, monkeys experienced 5 VT followed by 6 MT. 
They were taught four such sequences till they achieved 95% success rate on 
each sequence. After this stage, single cell recordings were made while monkeys 
performed several blocks of trials. Each block used a different sequence and the 
order of appearance of different sequences varied semi-randomly. The end of a 
block is signaled by random flashing of lights which also indicated a change in 
the sequence. 
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3.2 Experimental Results 

In the present study, we focus on the results of single unit recordings in SMA and 
PreSMA. SMA recordings (Tanji & Shima 15) revealed a significant activation 
exhibiting two main properties: i) sequence-specific activity: activity of some cells 
preceded specifically before the performance of a particular sequence (say, Turn- 
Pull-Push) and ii) transition-specific activity: some cells were tonically active 
after performing a specific movement and before initiating a particular next 
movement (say. Push followed by Pull). In contrast, the activity of majority 
of cells in the MI was selective to single movements but not sequence-specific. 
While SMA cells were mainly active in the MT phase, cells in PM were active 
only during the VT phase and cells in MI were active during all phases of the 
task. Some cells in preSMA were active only during the very first VT, possibly 
signifying the start of a new sequence and thus specifying the need for a change 
in motor plan (Shima et al. 11). In this article learning of sequence and transition 
specificity is demonstrated using MFM architecture. 

4 Simulation Results 

Detailed equations used in the simulations are shown in the Appendix. Each 
sequence is presented to the network successively for 5 blocks. All the four se- 
quences are shown to the network, the order of presentation being randomized. 
Each block consists of 5 VT followed by 6 MT as in the monkey experiments. 
During the VT, the network has access to the desired target and hence the lear- 
ning progresses in a supervised fashion. In the current simulations, during the 
VT period we assume that the appropriate (colored) light-to-movement mapping 
has already been established and thus the network performs at 100% accuracy. 
However, during the MT the desired output is not available and is calculated 
based on the reward. For example, if the predicted response is incorrect, then the 
desired target is set as the opposite of the actual response (for example, 110 for 
001). Thus the learning during the MT progresses in a semi-supervised fashion. 



4.1 Long Term Behavior of the Network 

Simulation results for 132 blocks are shown in Figure 2. The top row shows the 
randomized presentation of all the four sequences during the simulations. The se- 
cond and third rows show how the responsibility factor A varies as sequences are 
learnt during the last visual and memory trials, denoted as A-VT5 and A-MT6 
respectively. The bottom row shows the number of steps performed correctly 
(performance accuracy) in the last memory trail (MT6) as simulations progres- 
sed. It can be observed that as the switching became stable, the performance 
accuracy also improved. For example, although Module 3 initially captured both 
Sequence I (123) and Sequence II (132), specialization for Sequence II (132) de- 
veloped by 20th block and stayed stable since then. More details are shown in 
later figures below. 
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20 40 60 80 100 120 

Blocks [0 ... 132] 



Fig. 2. Results for all the blocks are shown. Legend: Modules - Ml : Module 1; M2 : 
Module 2; M3 : Module 3; M4 : Module 4; Sequence IDs - I : 123, II : 132, III : 321, 
IV : 312 



For clarity, various snapshots during the intermediate blocks are shown in 
Figures 3, 4 and 5. Figure 3 shows that the performance in the early blocks is 
erratic and that Module 3 is active during the presentation of Sequences IV, I, 
and II in the 6th memory trial (see the graph on the third row - A-MT6). 

By the 30th block in Figure 4, we see that Sequences II and III are represented 
by Modules 3 and 2, respectively and that for Sequences IV and I switching is not 
yet stable. However, by the end of the simulation the switching became stable 
and specialized (SeqLMl, SeqII:M3, SeqIII:M2, SeqIV:M4) as shown in Figure 
5. 

4.2 Short Term Behavior of the Network 

Short term behavior of the network during single blocks (Block 38 and 132) is 
shown in Figures 6 and 7. Sequence I is presented to the network during both 
these blocks. During the early stages the network did not arrive at a stable 
switching solution for Sequence L As shown in Figure 6, the network locked into 
an incorrect module (Module number 3 which represents Sequence II) by the 
end of the 5th VT and as a consequence there are significant number of errors 
of recall during the subsequent 6 MTs. In comparison, after extensive training 
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Fig. 5. Results of Blocks 114-132 are shown. Legends as in Figure 2. 



(by block 132 as shown in Figure 7), Module 1 specialized in Sequence I and the 
recall was accurate during all of the MTs. 

4.3 Discussion of Results 

Thus the learning of sequence specificity and stable switching among learned se- 
quences is achieved by this architecture. Each module learns the sequence tran- 
sitions appropriate to the sequence that the module specializes in. For example, 
Module 1 learns the transitions l-to-2 and 2-to-3. In the current formulation, 
sequence specificity in the modules arises as a result of learning the appropriate 
transitions. 

Although sequence and transition specificity are demonstrated, detailed com- 
parison between the results of monkey experiments and the simulations is left 
as future work. 

5 Summary 

Here we put forward some preliminary hypotheses about the brain areas whose 
functionality is captured in the MFM architecture (also indicated in Figure 1): 
Sequence Modules mimic the function of the SMA, the Responsibility Estima- 
tion may take place in the preSMA. One of the roles attributed to the basal 
ganglia is context dependent response selection (Houk et al. 5) and this aspect 
is attributed to the Responses election module in the proposed architecture. 
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Block 38 (SeqlD 1 : 123) 





Fig. 6. Detailed Results of 38th block are shown. 



In the current simulations, complex sequence learning, where the individual 
elements of a sequence may be repeating, is not demonstrated. A more general 
approach would be to use recurrent networks for sequence modules in place of 
purely feed-forward networks. Simulations of the general version are in progress. 

The notion of modularity and its usefulness in solving complex problems by 
decomposing them into sub problems has been addressed in several previous stu- 
dies (for example, Jacobs et al. 6; Sun & Peterson 12). Hierarchical organization 
of the modules is also studied (for example, Jordan & Jacobs 7; Parr & Russell 
8; Precup & Sutton 9; Tani & Nolfi 13). The architecture studied here also falls 
in the general framework of the previous studies. However we pursue the idea 
of hierarchical mixture of experts with a goal to studying them as functional 
models of sequential planning in biological systems. Although the current archi- 
tecture consists only of one level, the concept can be extended to multiple levels 
of hierarchy. In such a hierarchical system, module switching at any level will 
be affected not only by the the responsibility esitmations at the current level 
but also by the values at the other levels. In one such formulation, a network 
can be set up such that the sequential context encoded at each level increases 
as one moves up the hierarchy. Hierarchical networks are of immense use in the 
domains of natural language processing and sequential planning. 
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Block 132 (SeqlDI : 123) 





Fig. 7. Detailed Results of 132nd block are shown. 
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Appendix A — Equations 

A.l Forward Sequence Model Output 

\i=0 

where, j = 1,2,3 (Action Ouputs) 
k = 1,2, 3, 4 (Modules / Sequences) 
z = 0, 1, 2, 3, 4 (Inputs) 



{ 1 z = 0 (Bias node) 
0z= 1,2,3 
1 z = 4 (Ready node) 



r 1 z = 0 

j/j(t) = / {0, 1} z = 1, 2, 3 (Previous Response) 
0 z = 4 



yj{t + 1) G (0, 1); (Module Output) 

f{x) = 1 ^ 4 ^ ; (Sigmoid Function) 

I + e 
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A. 2 Response Selection 






1 j = m 
0 j ^ m 



where, 

TO = maxj (y*{t) + ; 






f Desired Target in VT 

in MX 

j = 1,2,3 



A. 3 Responsibility Estimation 
A. 3.1 Prediction Error 

error^it) = {yf{t) - 

**/,3 ^ f yj(t) if Correct Trial 
^ 1 ^ ~ Vj (^) Error Trial 



3 



A. 3. 2 Responsibility 

p^{t) = e 

A‘(t) = nLoxht-^) 

E*.,nLoP’((-«) 

1 

cr(t+ 1) = aa{t) + (1 - a) — E min [E^S)] 

^ 5=0 

where, p^{t) = ^; t < 0, fc = 1, 2, 3, 4 
cr(0) = 3 

Note 1: The scaling factor keeps a running average of the best prediction 
error and thus ensures that the responsibility is not locked onto one module 
prematurely. 

Note 2: p and a values are reset at the beginning of every block of 11 trials 
(6 visual and 5 memory). Whereas, all other variables are reset at the beginning 
of every trial. 
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A. 4 Weight Update in the Forward Sequence Model 

w%{t + 1) = w%{t) + ^X^error';{t)f^{t){l - - 1) 



w^j(O) G (0, 1) (Uniformly distributed random values) 
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1 Introduction 

Time is an important dimension in many real-world problems. This is particu- 
larly true for behavioral tasks where the temporal factor is critical. Consider 
for example the analysis of a perceptual scene or the organization of behavior 
in a planning task. Temporal problems are often solved using temporal techni- 
ques like Markovian Models or Dynamic Time Warping. Classical connectionist 
models are powerful for pattern matching tasks but exhibit some weaknesses in 
dealing with dynamic tasks involving the temporal dimension. Thus, they are 
efficient for off-line statistical data processing, but must be adapted for situated 
tasks which are intrinsically temporal. 

This adaptation can correspond to the coupling of connectionist models with 
classical temporal techniques, thus yielding hybrid models (Sun and Alexandre 
1997). It can also correspond to the design of new connectionist architectures 
with specific abilities for temporal processing. We have proposed in the past a 
classification for temporal connectionist architectures (Durand and Alexandre 
1996). Following this classification, these networks can be described according 
to the way time is represented within the architecture. 

Architectures with an external representation of time correspond to classical 
neural networks like multilayer perceptrons in which only the input space is 
modified to include the temporal dimension and some specific mechanisms are 
added. The Time Delay Neural Network (TDNN) is a typical example of this 
kind of system. This model was introduced by (Waibel et al. 1989) to learn 
and recognize phonemes in automatic speech recognition tasks. The input space 
is a window on a spectral representation of speech with frequency and time 
axes, and a mechanism of weight sharing ensures invariance of position in time. 
Apart from that, the general architecture of a TDNN is a classical multilayer 
perceptron learning with a backpropagation algorithm. Models with such an 
external representation suffer from serious limitations because time is considered 
as a dimension similar to other spatial or physical dimensions. 

Architectures with an internal representation of time are specific architectu- 
res designed for the purpose of dynamic information processing and are thus not 
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classical architectures with an adapted input space. The internal representation 
of time in these architectures can be built either in an implicit or an explicit 
way. 

Concerning the implicit method, typical architectures are recurrent neural 
networks (Pearlmutter 1990), that manage time via the succession of their in- 
ternal steady states. In this case, the current output is obtained as a function of 
the current input state and of the context, obtained from a copy of the previous 
internal state of the network (Elman 1990). This representation is called impli- 
cit because the succession of events only appears implicitly, through the context 
layer. 

On the contrary, the explicit method for the internal representation of time 
clearly implements time through sequences of events that can be detected direc- 
tly inside the network. Close to Markovian principles, these architectures include 
units that represent events, and links that represent transition probabilities bet- 
ween these events, thus yielding explicit sequence representations within the 
network. Models using this strategy of representation are often (but not always 
(Beroule 1990)) inspired by cortical functioning, either for memory mechanisms 
(Ans et al 1994) or for more general sensorimotor tasks (Alexandre 1996). Here 
the main problem is to design learning rules for establishing these temporal links 
and to integrate the latter with other classical spatial links. 

The biological validity of these various strategies for time representation in- 
side neural networks has been discussed elsewhere (Durand and Alexandre 1996). 
We just mention here that the external strategy could correspond to the pro- 
cessing of temporal sensory input like auditory input (Suga 1990), the internal 
implicit strategy could correspond to the highest levels of time integration (Do- 
miney et al. 1995), (Elman 1990) and the internal explicit strategy could cor- 
respond to some biological models of cortical functioning (Burnod 1989). Many 
neurobiologically plausible mechanisms of time processing have been proposed 
in the past. They are generally using an internal representation of time. Some 
of the most typical will be reviewed in section 2. 

Beyond this local view on specific temporal mechanisms, the goal of this 
chapter is to emphasize the idea of integrating these mechanisms into a more ge- 
neral framework, allowing a better exploitation and articulation between them. 
As will be discussed in section 3, the cerebral cortex may be seen as a set of 
functionally and architecturally different modules, each devoted to specific as- 
pects of time processing. Building modular architectures gathering and coupling 
these abilities can undoubtedly offer new temporal properties, particularly if 
integrated behavioral tasks are investigated, as reported in section 4. 

2 Temporal Mechanisms in Biological Modeling 

Even if the whole neural network domain often draws (more or less tightly) on 
biological inspiration, mechanisms like activation functions or learning rules are 
often designed with no reference to time, whereas a real neuron is a dynamic 
system that evolves over time. In this section, some biologically inspired temporal 
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models of neurons available in the literature will be presented. All of them have 
been designed in order to model experimental data from conditioning paradigms 
involving stimulus associations, sequence management, or the ability to keep 
cues in memory during a delay period. Underlying temporal mechanisms will be 
described here and related to the functioning and to the learning of neurons. 
The experimental framework and the fitting of the models to the data will not 
be mentioned and the reader might refer to cited papers for further information 
concerning those points. 

2.1 Functioning Mechanisms 

In simple terms, neuronal functioning can be explained as follows. Dendrites 
receive signals from other neurons and transmit them to the soma and the axon. 
Then, a non-linear functioning causes an output signal to be propagated on the 
axon toward other connected neurons. This signal constitutes a spike train. Only 
the quantity and the timing of spikes code the information. In classical neuronal 
models, a continuous value, the activation, stands for the mean frequency of 
the firing rate and thus hides individual neuronal temporal behavior. On the 
contrary, it is exploited by some more biologically inspired models. 

Spiking neurons One of the lowest levels of description of neuronal temporal 
behavior is the spike itself (Gerstner 1998). In the so-called spiking neuron ap- 
proach, neuronal activity is fully reported with, for each neuron t, the set of its 
firing times. 

= ( 1 ) 

Based on these elementary data, various neuronal operations can be implemented 
(Gerstner 1998). They can investigate rate coding (with an average over time, 
over several cycles or over a population of neurons) or pulse coding (strategies 
of coding based on synchronicity of spike timing) . 

Several models of neurons, from the simplest to the most complex (e.g. com- 
partmental), have also been designed within this formalism (Maass and Bishop 
1998). For example, the Spike Response Model describes the neuronal state Ui{t) 
at time t (which can be interpreted as the cell membrane potential) as the sum 
of the neuronal response to its own spikes (refractory period with negative kernel 
? 7 i(s)) and the response to presynaptic spikes (excitatory (resp. inhibitory) post- 
synaptic potential given by a positive (resp. negative) kernel (s)), as written 
in equation 2. 

u^{t)= (2) 

where T) is the set of neurons connected to neuron i and Wij corresponds to the 
synaptic strength of the connection between the neurons i and j. 

This formalism has the advantage of being established at a very low (and thus 
precise) level of time. It has also given rise to a variety of theoretical as well as 
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applied studies (Maass and Bishop 1998) which now establish spiking neurons as 
a full domain of research. However, the corresponding drawback of this formalism 
is that its low level of granularity implies a large amount of computation even 
for a simple network of neurons. Another drawback is related to the lack of well 
established learning rules at this level of description. That is one reason why it is 
important to keep in mind other formalisms describing these (and other related) 
temporal mechanisms. 



The leaky integrator At a higher level of description than the spike, explicit 
temporal functions can also be used to obtain a neuronal temporal behavior. 
In the simple leaky integrator model (SLI), the input to a neuron at time t, 
denoted by I(t), can take a continuous value between zero and one (Reiss and 
Taylor 1991). Then, the membrane potential A{t) is written as: 

A(f+l) = /(J)./(t) + (l-/(/)).A(t) (3) 

where the function /(/) is defined, with constants a and d : 

/(/) = d{l - /) + al. (4) 

Finally, the output of the neuron is computed via the Heaviside function H as : 

Out{t+l) = H{A{t) -Q.b) (5) 

As illustrated in figure 1, functions A and / are such that they provide to the 
neuronal internal state a wave input attack in the time l/ln{a) and a wave 
input decay in the time l/ln{d). It is thus possible to determine the shape of the 
activity with the choice of constants a and d. Of course, these phenomena can 
also be obtained within the spiking neuron formalism (Gerstner 1998), but at a 
much higher computational cost. 

This trace mechanism, describing neurons as leaky integrators, makes the 
neuronal activity last longer than its input. We will explain below how this 
mechanism is important for a network to properly learn and recall sequences. 



The gated dipole Synaptic functioning has also been studied, with differential 
equations over time describing at each time the variation of some state variables, 
as in the SLI model. This approach has led to many models, sharing common 
features. These features are described below, on the basis of the gated dipole mo- 
del introduced by Grossberg (Grossberg 1984). The model is a good illustration 
of the kind of complex temporal behavior that can arise from straightforward 
equations. 

The gated dipole is grounded on the modeling of synaptic dynamics. The role 
of the synapse is to transfer a signal from presynaptic to postsynaptic nervous 
fibers. Let t be the (continuous) time parameter, S (t) the value of the presynaptic 
activity, T{t) the value of the postsynaptic activity and z{t) the conductance of 
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Activity 




Fig. 1. This figure represents the typical shape of activity obtained with a trace me- 
chanism. It also illustrates how this mechanism enables two events occurring at distinct 
instants to have lasting activities, to meet and to interact via learning 



the synapse provided by chemical neurotransmitters. Equation 6 describes the 
way S is transmitted through the synapse over time, figure 2 illustrates it. 



T{t) = S{t).z{t) 



(6) 



A perfect conduction would correspond to a constant conductance z{t) over 
time. When S{t) occurs (becomes non null), some previously stored neurotrans- 
mitters are emitted to provide conductance (this is considered as instantaneous 
in the model). The synapse continuously produces neurotransmitters to “refill” 
the stock until saturation. The production and consumption of neurotransmit- 
ters can then be described with equation 7. 

j^z{t)=A.{B-z{t))-S{t).z{t) (7) 

The first term of equation 7 represents the production of neurotransmitters, 
with a speed A, until the level B. The second term is the consumption of neu- 
rotransmitters by the signal S{t). If signal S{t) is kept constant, production 
and consummation of neurotransmitters complement each other and the output 
signal T{t) reaches an equilibrium value such that: 



-z{t) = Q = A.{B-z{t))-S{t).z{t) 



A + S(t) 



A + S(t) 



(8) 



The function T = f{S) at the equilibrium state is non-linear, monotonic increa- 
sing and saturates with the value AB. Such a function is similar to the sigmoidal 
transfer function used with the classical formal neuron. Outside the equilibrium 
case, the dynamics of the synapse, shown in figure 2, have interesting temporal 
properties. For example, overshoots and undershoots can trigger events when 
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S{t) respectively sets and resets. Moreover, the decay from overshoot to habi- 
tuation can be considered as a progressively decaying trace of the burst of S. The 
concept of trace is useful to correlate time separated events, as will be discussed 
later. 

This model is suitable for detecting transitions of signals and illustrates well 
the kind of temporal properties easily obtained by using appropriate intrinsically 
temporal models. 




Fig. 2. The gated dipole, from (Grossberg 1984). 



2.2 Learning Mechanisms 

Classical static learning mechanisms are often based on Hebbian rules, which 
compute weight variations from the correlation between presynaptic and post- 
synaptic activities, according to the rule: 

AW^j = aS^.Tj (9) 

This static view is obviously an approximation since it is clear that, as a con- 
sequence of presynaptic activity, postsynaptic activity is not simultaneous but 
consecutive to presynaptic activity. Several temporal learning mechanisms have 
been proposed and take this constraint into account. 
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The common feature that underlies these mechanisms is the use of traces, so 
that instantaneous correlations between traces of signals signify temporal corre- 
lations of those signals (cf. figure 1). The shapes of the traces, and the properties 
of the learning rules, endow the models with their own specific properties. 



Learning the earliest predictor Adaptive behavior has been widely studied 
throughout classical (Pavlov 1927) and instrumental (Skinner 1938) conditio- 
ning. The classical conditioning experiment is described in the following terms 
in (Sutton and Barto 1981): 

..., the subject is repeatedly presented with a neutral conditioned sti- 
mulus, that is, a stimulus that does not cause a response other than ori- 
enting responses, followed by an unconditioned stimulus (UCS), which 
reflexively causes an unconditioned response (UCR). After a number of 
such pairings of the CS and the UCS-UCR, the CS comes to elicit a res- 
ponse of its own, the conditioned response (CR), which closely resembles 
the UCR or some part of it. 

There exist several variations of this simple experiment. For example, one can 
try to condition the subject with the help of several CSi. The point is that the 
delay separating CSi from UCS (inter-stimulus interval or 1ST) plays a crucial 
role in the success or failure of the conditioning. Neural modeling takes this point 
into account. Furthermore, a large amount of experimental paradigms (oversha- 
dowing, blocking, etc.) give serious clues concerning the temporal mechanisms 
underlying conditioning. 

Sutton and Barto (Sutton and Barto 1981) propose a model based on activity 
traces which is able to take these kinds of data into account. The model is 
grounded in a formal neuron (cf. figure 3) where input Xq is the value of the 
UCS, other inputs Xi are respective values of the CSi and y is indifferently the 
CR or the UCR. 

The model is driven by the following equations, where / is a sigmoid function, 
a and (3 are positive constants with 0 < a, /3 < 1 and c is a positive constant 
determining the learning rate: 



Xift -1- 1) = axiff) + Xi{t) 
yft -k 1) = (3y{t) -k (1 - f})y{f) 


(10) 

(11) 


y(*) = 


(12) 


J — 

Vz G [l..n],Wi(t-\- 1) = Wift) + c[y{t) - y{t)]x^{t) 


(13) 


Each input Xi and y has a trace activity, respectively Xi and y. At the begin- 
ning of conditioning, in the case of a single CS (corresponding to a single input 
x), the output y is activated at the same time as UCS (or xq) because of the 



fixed weight wq. Meanwhile, if x is active, it will increase its weight w since this 
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Xo=UCS 




Fig. 3. The adaptive element of (Sutton and Barto 1981). 



latter is proportional to {y{t) — y{t)) and x. It will then soon be able to trigger 
the output on its own. The adaptive element has learned to trigger a response 
whenever CS is present just before UCS. As in the classical conditioning para- 
digm, the delay separating the onset of CS and UCS plays a crucial role since 
learning only occurs in case of temporal overlap between a positive trace of x 
and a “burst” of (y{t) — y{t)). 



UCS 
CS : X 

X 



ISI 




CR, UCR : y 



Fig. 4. Time courses of element variables after conditioning. CS elicits a response of 
its own. 



Another aspect of classical conditioning is the context factor, which may be 
determinant during the learning phase. For example, when a first association 
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Connection 

weight 




Fig. 5. The blocking paradigm. As long as CS 2 is not the earliest predictor (step 1 and 
2), u )2 remains unchanged. As soon as CS 2 becomes the earliest predictor (step 3), wi 
decreases to zero while W 2 reaches its asymptotic value. 



C'S'i — >■ CR has been learned, the addition of a new CS 2 contingent to C'S'i 
generally implies a very weak CS 2 — >■ CR association or no association at all. 
This phenomenon is called “blocking” and is modeled by the Rescorla-Wagner 
theory (Rescorla and Wagner 1972) which states that organisms only learn when 
events violate their expectations. Figure 5 shows the evolution of weights for such 
a blocking paradigm, where three phases can be distinguished: 

1. This is a classical conditioning experiment with only one C'S'i. Weight w\ is 
then increased up to its asymptotic value. 

2. A second CS 2 contingent to CSi is added, but, since learning has occurred 
at a previous phase, output (triggered by CSi) onset is now overlapping with 
CSi onset. CS 2 is then unable to learn anything since its trace activity X 2 is 
not positive during the output onset. Weights w\ and W 2 remain the same. 

3. CS 2 onset is now earlier than CSi, trace activity X 2 is now positive during 
the output onset (triggered by CSi), weight W 2 is then increased. The con- 
sequence is that output will be triggered sooner and sooner by CS 2 up to 
the point where xi will be positive during the output offset. Then w\ will 
be decreased to zero. CS 2 has become a better predictor of UCS than C'S'i. 

Learning the date of the predictor Learning the earliest predictor is useful 
for anticipating the consequence of an event. However, if the time interval that 
separates the event and its consequence is nearly constant, it may be useful to 
be ready only when the consequence occurs. A mechanism allowing the memo- 
rization of this time interval has been proposed in (Grossberg and Schmajuk 
1987), (Grossberg and Schmajuk 1989). This mechanism is grounded on a trace 
that arises after the burst of a signal and then shuts down. This defines a time 
interval, after the burst of the signal, and correlation can be computed during 
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this interval. The idea is then to provide many synapses with different time con- 
stants, and thus to allow an overlap during complementary intervals. The model 
is called the Spectral Timing Model. 

Let us first describe the functioning of one synapse, and then the use of a 
range of such synapses for anticipation. Let Ics be a conditioned signal (a step 
function), and Ijjcs be the unconditioned signal. When learning has occurred, 
lues bas to be anticipated when it should occur, according to Ics- Synapse i is 
described by three variables, according to the following equations, where / is a 
sigmoidal function. 



= a,[-Axi + {1 - Bxi)Ics] (14) 

= C{1 - Vi) - Df{x,) (15) 

'^Zi = Ef{xi)yi[-Zi -f lues] (16) 

Equation 14 allows Xi to transmit Ics with a delay, depending on parame- 
ters. Equation 15 describes spontaneous production of neurotransmitter yi and 
consumption of this neurotransmitter when the synapse transmits Xi. This equa- 
tion is similar to equation 7. As a delayed signal Xi is transmitted through the 
synapse, using neurotransmitter yi, the transmitted value of Xi, i.e the product 
f{xi)yi, is a trace of the occurrence of Ics- Then, equation 16 describing the 
strength of association Zi performs the evolution of Zi toward lucSj but only 
when the trace of Ics is strong enough. The shape of the traces for a high ai 
(in plain line) and for low ai (in dashed line) is illustrated in figure 6. 





lues 



Fig. 6. Spectral Timing Model, from (Grossberg and Schmajuk 1987). 



Spectral timing consists then in endowing a neuron with a battery of syna- 
pses, having different values ai, for correlations with a given Ics (cf- figure 6). 
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The response of the neuron is then given by equation 17. 

n + 



R = 



\f{xi)yi.Zi - F 



with 



^ = X if x > 0 



X 



= 0 otherwise 



(17) 



The behavior of the neuron then consists in responding to an Ics by summing 
traces that have been associated with lues (they have a high Zi value). The 
response then occurs at the moment when lues should occur. 

Finally, let us mention that the spectral timing paradigm has also been suc- 
cessfully implemented with a range of leaky integrators as a model of frontal 
cortex (Dominey et al. 1995). 



Context-dependent learning The models previously discussed were groun- 
ded on classical conditioning experiments, and deal with the association of two 
signals. Some other models describe a synapse concerned with a third signal, 
modulating the associative role of the synapse. This defines the concept of the 
synaptic triad. 



S(t) 

S(t) 

a 



W(t) 



R(t) 




Fig. 7. The synaptic triad, from (Dehaene and Changeux 1989). 



The synaptic triad model of Dehaene and Changeux (Dehaene and Changeux 
1989), illustrated in figure 7, involves a presynaptic activity Sa{t), a postsynaptic 
activity Sp{t), and a modulation signal Sm{t) that acts on the synaptic weight 
W{t). The weight W{t) is a trace of the modulation activity. 

+ 1) = { ‘s ^ m 

[adW[t) It Sm(t)< 0.5 

The Op and parameters are in the range [0, 1]. The weight W{t) is a trace 
that increases toward the current maximal authorized weight lT’”(t) when the 
modulation signal Sm(t) is high, and decreases otherwise. The contribution of 
all the synapses i to the output signal Sp{t) is given by: 

Spit+l) = Y,W\t)Slit) (19) 

i 



The temporal learning rule for the triad actually computes the value of the 
synapse. The learning rule is given by: 



6W"^{t) = (3R{t) 






(20) 
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The function R{t) is a reinforcement signal, related to the experimental frame- 
work of the model (Dehaene and Changeux 1989). It will be considered as a 
positive constant and will not be discussed here. The maximal weight VF"* in- 
creases according to the cooccurence of presynaptic signal Sa and postsynaptic 
signal Sp, as for Hebbian learning rules, but this occurs only when the trace W 
of Sm is close to its maximum value. The signal Sm can be viewed as a con- 
text value, and the synapse will conduct signals only when the context is active 
(cf. equation 19). 

More complex contextual rules have been developed since the initial model 
by Dehaene and Changeux, dealing with non-simultaneity of presynaptic and 
postsynaptic signals, i.e one has to come before the other when context is active 
in order to increase the weight of the synapse (see (Guigon 1993) for example). 

2.3 Biological Temporal Mechanisms 

This overview of some functioning and neuronal learning mechanisms illustrates 
the variety of temporal neuronal properties that can be exploited in artificial 
neural networks (and that are absent in classical static models). As mentioned 
above, these mechanisms are generally used to model experimental data and 
often stick to the experimental framework. They have to be integrated in a more 
complete architecture in order to address wider range of real-world problems. 
For example, this has been successfully done for radar imaging by Grossberg 
and colleagues (Gove et al. 1995), (Grossberg et al. 1995). 

Engineering-like applications, like autonomous robot control, are often tack- 
led by Markov Decision Processes such as Q-Learning (Watkins 1989) (see (Litt- 
man 1996) for a good overview of these techniques). In the latter case, time is 
considered as a discrete parameter, ordering a series of synchronous interactions 
between the agent and the external world. However, this kind of approach seems 
to be seldom applied to real asynchronous problems. 

Biological models may be an efficient way to overcome this difficulty, due 
to the robustness of temporal mechanisms like the trace concept. Our claim is 
that the integration within a cortical framework of such elementary biologically- 
inspired mechanisms can lead to efficient systems for real-world applications. 



3 Cortical Modeling 

The goal of establishing a framework for cortical modeling is to bridge the gap 
between the isolated temporal mechanisms and the distributed and polymodal 
nature of the cerebral cortex itself. This makes it possible to efficiently express 
complex behaviors, including several temporal resolutions. Before proposing (in 
the next section) our computational models for this framework, we report the 
outline of the biological model that underlies it. The reader can refer to the 
original description of the model (Burnod 1989) for more details about the un- 
derlying biological data. These data yield two levels of description. First, at the 
global level, architectural and functioning data give hints about different kinds 
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of processing regions in the cortex, and information pathways between them. 
Second, at a more local level, neuronal circuitry is described together with the 
corresponding functioning and learning rules. 



3.1 The Global Level: A Network of Areas 

The global description of the four major lobes (frontal, parietal, temporal and 
occipital) with the functional description of the cortex proposed by Brodmann as 
early as 1909 (Brodmann 1909), gives precious hints for a functional approach. 
It makes the distinction of four major types of areas, as illustrated in figure 8 
and described below (using the formalism of this figure). 




(a) Main lobes of the cerebral cortex 



(b) Model of the connectivity between 
cortical areas (adapted from (Burned 
1989 )) 



Fig. 8. Anatomical and functional view of the cerebral cortex 



Sensory areas These mono-modal areas map the sensory information coming 
from the different body receptors which can be divided into five main sensory 
poles (auditory (A), visual (V), somesthesic (S), olfactory (O) and internal mole- 
cular (SH)). Moreover, the information flow is structured in such a way that to- 
pology is conserved from peripheral receptors to cortical sensory areas. A fixed 
mono-modal perceptive sequence can be encoded at this level. 
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Motor areas (M) These mono-modal areas allow the performing of actions 
upon the internal world (e.g. hormonal secretion) or the external world (e.g 
hand movement). Actions can be executed simultaneously but also coordinated 
in a complex sequence of movements (but fixed at this level). 



Posterior associative areas These polymodal areas, including temporal and 
parietal areas, are crucial to cortical organization since they allow the linking 
of at least two areas, one with the other. They will for instance allow direct 
sensorimotor coordination encoding in the case of a link between a sensory and 
a motor area (e.g. hand-eye coordination). Moreover, these associative areas may 
also link two sensory areas or two other associative areas, allowing in this way the 
construction of a more structured and integrated representation of information. 
Stereotypical sensorimotor sequences (e.g. reaching one’s mouth with one’s hand) 
can be learned in these areas. 



Prefrontal associative areas From a functional point of view, prefrontal as- 
sociative areas have to be distinguished from their posterior counterparts. The 
former are generally action-oriented and play a major role in temporal organiza- 
tion of behavior. Furthermore, the privileged relations with the posterior cortex 
(cf. fig. 8(b)) and the presence of specific temporal mechanisms within prefrontal 
units (bistable) make prefrontal areas able to construct and coordinate dyna- 
mic temporal sequences grounded on posterior ones (e.g. guidance of the hand 
toward a goal in the focus of attention). 



Information pathways To throw light on the nature of cortical organization, 
cortical modeling defines a framework for a distributed polymodal representa- 
tion of information. The integration of numerical data into more structured and 
“sub-symbolic” reference frames is possible throughout a hierarchy of polymo- 
dal areas. We are thus offered a way of keeping to some extent the robustness 
of numerical data while manipulating this data at a higher level. These cortical 
mechanisms involve (mainly in the posterior cortex) statistical and slow lear- 
ning, and perform a kind of extraction of the world regularities through the 
construction of stereotypied temporal sequences. The ability to model complex 
and more dynamic behaviors requires additional mechanisms that can be provi- 
ded by frontal areas or even by extra-cortical structures. The cortical network 
is not fully interconnected but defines via associative areas three privileged pa- 
thways between the motor system, the internal state and the perception of the 
outside world in the following way: 

— parietal areas relate the outside world with the motor system 

— temporal areas relate the outside world with the internal state 

— frontal areas relate the internal state with the motor system 

Moreover, as shown in figure 8(b), there are privileged connections between 
frontal areas and posterior cortex: each posterior area is mirrored within the 
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frontal lobes. This anatomical design suggests an interlaced cooperation between 
perceptive posterior representation and frontal motor ones, for the temporal 
organization of behaviour. Indeed, frontal areas are believed to play a central 
role in most complex temporal behaviors such as anticipation, planning, working 
memory or any other dynamic temporal sequencing behavior. 



3.2 The Local Level: Neuronal Assemblies 

A more detailed analysis of the inner organization of the cortical sheet may also 
describe this as a large set of elementary circuits: the cortical minicolumns. Each 
of those minicolumns receives a subset of the intra or extra-cortical information 
and because of the topological property of the cortical areas, neighboring mi- 
nicolums will tend to receive the same subset of information. These groups of 
minicolumns are called maxicolumns: they share the same information subset 
but are able to apply different filters on it. The model of the cortical column 
reported in (Burnod 1989) describes the functioning and learning properties of 
such maxicolumns, which are different from those of the formal neuron. 



Architecture The cortical minicolumn (also called cortical column) is a group 
of a maybe a hundred interconnected neurons where activity is essentially rela- 
ted to the pyramidal neurons while the other neurons, excitatory or inhibitory 
interneurons, mainly participate in the inner mechanism of the column. Moreo- 
ver, the cortical column is a six layered structure (cf. figure 9) where layers I 
to III allow communication with other cortical columns while layers IV to VI 
allow communication with extra-cortical structures. Depending on the area the 
column belongs to, the size of each layer may vary greatly. 




Afferent input 



Pyramidal neurons 
Upper layer Lower layer 




Fig. 9. The six layered structure of the cortical column and the corresponding 
input/output data channels 
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Basic operations The inner mechanisms of the column result in three possible 
distinct levels of activity for the pyramidal neurons: 

— Inhibited level EO represents very weak activity 

— Low level El represents small variations of relatively low frequencies (5Hz 
to lOHz). We will refer to this priming state as the call state 

— High level E2 represents a much higher frequency (50Hz to lOOHz). We will 
refer to this state as the satisfaction state 



Spatial filters As said before, a maxicolumn is a set of several neighboring cor- 
tical columns sharing the same subset of information and constitutes a functional 
module. As long as there is no specialization of these columns, they will often be 
activated in a moderate way for any pattern of information. Learning is then the 
ability for a minicolumn to become specialized on a precise pattern of informa- 
tion while others in the same maxicolumn are inhibited (coupling/uncoupling). 
This task corresponds to filtering or feature extraction. 



Spatio-temporal filters Cortical columns are able to activate themselves at 
three distinct levels. Activation of a column at level E2 requires the simultaneous 
activation of both cortical and thalamic inputs. Consequently, when a column 
A receives a cortical input alone or a thalamic input alone, it will not reach 
level E2 but rather level El. This activity El is nonetheless propagated to all 
neighboring columns. If among them, one (C) is excited via its thalamic input, 
it will reach level E2 and will produce an extra-cortical action (whatever the 
target). This action will then modify the thalamic context, which may now be 
propitious for the activation of column A to level E2. Column A will then learn 
to preferentially call column C since this latter is favorable to the excitation of 
the former (cf. figure 10). This mechanism may be seen to some extent as a goal 
directed search: the level El is a desired or calling state, while level E2 is the 
satisfaction state. 

The temporal mechanism of spreading intra-cortical activation indeed allows 
the search for sequences that are able to satisfy the calling column. Learning 
will then consist in slowly orienting the call activity toward columns that help 
to reach the satisfaction state. 



Bistable units All the prefrontal area units share a common mechanism: the 
bistable mechanism. Bistable units possess two stable states: a resting state and 
a sustaining state. Both ON (at rest to sustained activity) and OFF (sustained 
activity to rest) transitions require an external activity (e.g. external stimuli A 
and B) to be performed (cf. figure 11). Thus, while cortical posterior columns 
are only able to organize sequences at one level (e.g. A-B-C), frontal ones are 
able to organize hierarchical sequences (e.g. (A-(B-C))). 
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Column at level EO 
lohibited state 



1 



Column at level El 
Calling state 



X Column at level E2 
Satisfaction state 



Fig. 10. 1) Column A receives cortical input and activates itself to level “El” 2) 
Activity is spread to neighbouring columns 3) Thalamic context allows the activation 
of column C to level “E2” , an action is performed that modifies the thalamic context 
4) Thalamic context modification allows column A to reach activation level “E2”. 
Learning will occur and column A will learn to preferentially call column C. 
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Fig. 11. 1) Bistable A+B is at rest 2) Detection of stimulus A produces the ON 
transition on the bistable B~ 3) The bistable activity is sustained 4) Detection of 
stimulus B produces the OFF transition on the bistable B~ 
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4 Cortical Temporal Mechanisms for Computer Science 
Purposes 

As computer scientists, we are specifically interested in efficient software design 
for real-world problems. We think that such an engineering oriented purpose can 
benefit from the study of biological data. Nevertheless, most biological models are 
not suitable for immediate software integration, but have rather been designed 
to explain experimental data. Classical connectionist models, like multi-layer 
perceptrons or Kohonen’s Self Organizing Maps can be viewed as the adaptations 
of biological models to software engineering constraints, adding mechanisms such 
as backpropagation with derivable transfer functions, or explicit winner-takes- 
all. 

Concerning computation of temporal information, which is crucial when ad- 
dressing real problems, a biological framework also appears to be helpful (cf. sec- 
tion 3). As is shown by the attempts we have mentioned, adaptation of biological 
temporal models to information processing is not obvious, and mechanisms that 
arise from such attempts appear to be heterogeneous (cf. section 1). We propose 
that one way to design reusable algorithms is to refer as precisely as possible 
to a cortical framework, allowing us to deal with highly integrated architectu- 
res. Integrated temporal processing implies different levels of time computation 
that have to be consistent one with each other. Within the same cortical para- 
digm, presented in section 3, we first describe low level (small time constant) 
mechanisms that are applied to temporal pattern recognition. We then consider 
causality detection, applied to the learning of higher level reactive abilities of an 
autonomous robot. Finally, we look at temporal mechanisms involved in neural 
planning with bistable units. 

4.1 Monomodal Sequences within a Map 

The Temporal Organization Map (TOM) architecture (Durand and Alexandre 

1996) has been proposed as a model of the auditory cortex and applied to speech 
processing. At the lowest level, a simple cochlea model (Hartwich and Alexandre 

1997) performs a spectral transformation of the speech signal. The processing 
level is a map of super-units, each super-unit standing for a maxicolumn, as 
described in section 3.2. As a first temporal mechanism, each super-unit has a 
spatio-temporal receptive field. The spatial field corresponds to the integration of 
a set of contiguous fibers from the cochlea model. The temporal field corresponds 
to a leaky integration, controlled by a decay parameter. This short term memory 
allows the integration of activation within an interval of time. 

The second temporal mechanism is performed within the processing map. It 
corresponds to intra-map temporal links between super-units which can build 
explicit sequences of activation within the map. Robustness to sequence distor- 
tion (insertion or deletion) is obtained through the interaction of both temporal 
mechanisms. A division mechanism was also implemented and leads to specia- 
lizations within super-units in order to differentiate sequences passing through 
the same units. 
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This model was applied to spoken digit recognition using some classical 
benchmarks in speech processing (Durand and Alexandre 1996). It has obtained 
performances similar to the best stochastic models, with a better performance 
given the temporal and spatial complexity of the algorithm. 



4.2 Multimodal Causality Sequences for Reactive Integrated 
Behaviour 

The mechanism described here is related to internal computation within an 
artificial cortical map. The maxicolumns (see section 3.2) of the map are involved 
in the learning of temporal regularities, and are the basic units of the model. As 
such, a unit represents a population of synchronous cortical columns that receive 
the same information. Three kinds of activation have been defined, as described 
for a single cortical column (Burnod 1989). 

First, unit i stores an excitation activity as soon as the perceptive 

events it is associated with occur. Another associated recency signal is 
used as a trace of the occurrence of the event. It is initialized to 1 and linearly 
decays over time. When it reaches 0, the stored intensity is reset to 0. The 
use of two variables and prevents the confusion of strong old events 

with weak recent ones. 

Second, a call activity of unit i means that event e* is useful, and has to 
occur. For simplification here, this activity will be considered as a boolean value 
(1 or 0), whereas it is, in the model, a continuous value, representing the strength 
of the request for d. 

Third, when a specific event that was requested through the call activity has 
occured (excitation of a called unit), the unit is said to be satisfied. 



Learning by specialization When learning occurs at the level of a cortical 
maxicolumn, columns inside the maxicolumn, that were synchronously firing, 
separate into two asynchronous parts. Each part can split again, further refining 
learning (see (Burnod 1989) for details). As units of the model represent syn- 
chronous columns, the effect of learning at the level of a unit is the creation of 
a new unit, both units representing the new synchronous sets of columns. 



Temporal learning rule The aim of the learning rule presented here is to build 
sequences of units, based upon perception. Let us suppose that a given unit i 
is called, and that the associated event Ci always occurs after another event cj, 
i.e. after of unit j is set. It is then possible to conclude that getting Cj 

is a means to get the requested e*, i.e. call activity has to spread from unit i 
to unit j. If we consider unit i as the goal “getting occurrence of Cj” when it is 
called, unit j can be viewed as a subgoal to be called. The learning rule described 
in equation 21 allows the detection of the goal/sub-goal relationships between 
units, by increasing weights between units. When a subgoal is detected, it splits, 
and the split unit is devoted to receiving a call activity from the goal. As it is 
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called after the call activity at the level of the goal, the subgoal can play the 
role of a goal for other units in the map, in order to extend the causal sequence. 

Let i be the unit whose subgoals are detected by using the mechanism, and 
j one of the other units in the map. The weight Wij, initially null, represents the 
causal relationship between i and j, and the splitting of j occurs when it reaches 
1. A flag Sij is associated with each Wij. Weights are updated according to the 
procedure 21. 



if Aj- = 1, S,j ^ 1. 

if = 1 and = 1, (satisfaction of i) 

case Aj®® = M* and M* > 0 : ^ + t.A®®®.£;®’®®' 

case 0 < Aj®® < M* : Wij ^ + T.(£;j®® - 

case A®®® = 0 : Wij ^ Wij — 6 (21) 

in all cases : 6ij ^ 0 

else if A®®'^* = 1 and Aj®® reaches 0, 

Wij ^ Wij - T'.Ef^^\6ij 

else nothing to compute. 

Parameters r and t' are fixed learning rates, the symbol •<— stands for variable 
setting and M* = rnax^Aj®®. Learning Wij occurs only when unit i is a goal 
(^caii _ When an event Cj occurs (A®®® = 1), the flags 5k j for all k are raised. 
If the goal i is satisfied (LI™® = 1 and A®®" = 1), the weight Wij to unit Ej that 
has been excited the most recently (LI}®® = M*) increases, proportional to both 
the recency and the stored intensity A}’®®' of the event ej. For the other j units 
that are “quite recent” (Aj®® > 0), the weights Wij are decreased, proportionally 
to the relative age Aj®® — of ej occurrence, and also according to A®’®®', 
consistent with the rule for unit J. The mechanism for the most recent unit J 
and the other recent j is a competition for recency, detecting the last predictor 
of the goal satisfaction. Events that have not occurred before the satisfaction of 
the goal i (Ej®® = 0) are decreased with a decay value. Note that satisfaction 
of the goal j resets the flags 5ij of related weights Wij. If an event ej occurs, 
without being followed by the satisfaction of the goal, i.e the trace Ej®® reaches 
0, if the goal is still being called (E®®** = 1) and if it has not been satisfied since 
the occurrence of Cj (the flag 5ij is still raised), then the weight Wij is decreased. 

This mechanism has been shown to be robust to different kinds of temporal 
noise, mainly distortion of sequences, insertion of events, permutation of items 
in the sequence (Frezza-Buet and Alexandre 1999). It is suitable for detecting 
sequences of causality between perceptive events that are intrinsically asynchro- 
nous, which is the case for real perception. Finally, the mechanism allows the 
detection of the last predictor, as opposed to Sutton and Barto rules (see sec- 
tion 2.2) that detect the earliest. This is important since spreading calls from a 
goal to its successive subgoals requires the learning of all intermediate events of 
a perceptive sequence, and not only the first, even if it is actually predictive of 
the last. 
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Application This mechanism has been used inside each map of a multi-map 
architecture for robot control (Frezza-Buet and Alexandre 1998b). This architec- 
ture contains many maps for detecting multi-modal events that are linked with 
the mechanism presented here. The multimodal sequences that are learned are 
the basis for a competition within units in the model. This competition allows 
the robot to trigger the appropriate action at each time, according to the needs 
that initiate call activities, according to the perception that initiates excitation 
activities, and to the knowledge concerning the world stored in the Wij. The 
cortical framework, that drives the design of the map architecture, coupled with 
the local temporal mechanism described in this section, has led here to an effi- 
cient control architecture endowing the robot with the ability to learn elaborated 
reactive behavior from its experience in the environment. 



4.3 Context Detection for Bistable Transitions 

The previously described model deals with one level sequences (no sub-sequences), 
providing procedural abilities for reactive behavior. This model refers to po- 
sterior cortex functionalities. As we are interested in more complex behaviors, 
involving planning on the basis of connectionist computation, we are currently 
studying prefrontal functions. Some early modeling results will be presented here 
briefly, in order to introduce the use of the context manipulation mechanism de- 
tailed here, allowing temporal scheduling of actions. 



Prefrontal modeling framework for neural planning As mentioned in sec- 
tion 3, prefrontal functionality is grounded on cortical columns having a bistable 
activity pattern. The model described now is an attempt to use this ability for 
planning the behaviour of the robot. Compared with a biological description of 
the cortex, and more precisely of the prefrontal lobe, our functional approach is 
of course very rough, but it has to be seen in the context of the design of efficient 
and highly integrated control architectures. In this modeling framework, prefron- 
tal cortex is a set of units connected to posterior cortex units with one-to-one 
connections (cf. figure 12). The posterior cortex part of the model is similar to 
the one mentioned in section 4.2. It is a module allowing complex servo-control 
oriented computation, i.e. a call activity at the level of a posterior unit (the PiS 
in figure 12) triggers an elaborated action of the robot, as “facing the current 
target” for example. Then, the role of the prefrontal cortex part of the model 
(the FiS in figure 12) is to schedule posterior calls, in order to plan the behaviour 
of the robot towards finding rewarding situations. 

This scheduling, described in (Burnod 1989) and in a more computer science 
oriented way in (Frezza-Buet and Alexandre 1998a), is illustrated by the fol- 
lowing example. Let us suppose that a sequence of events a — b — c has to be 
performed for getting a reward. That means that posterior units Pa, Pb and 
have to be excited in that order, after successive calls in these three units. The 
role of associated frontal units Fa, Ft and Fc (cf. figure 12) is then the following. 
First Fa triggers a call on posterior unit Pa- As a consequence of this call, let us 
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suppose that event Co occurs, meaning that Pa is excited. Then, Fa sends call 
activity to F^, that calls Pb to get et, in the same way. When tb occurs, Fb trans- 
mits the call to F^, that enables the occurrence of the rewarded event Cc- The 
significance of this sequence, compared with the posterior model of section 4.2, 
is the way failures of calls are managed. Let us suppose that the call in Pa trig- 
gered by Fa in the previous example is not followed by the event ta that makes 
Pa excited. That means that the current context does not allow the getting of 
this event, and that something else has to be done before getting Co is possible. 
What has to be done before is the activation of another sequence of events (they 
may involve motor events as a consequence of calls), noted aa — ab in figure 12. 
The failure of the call in Fa makes Fa have a sustained specific activity (transi- 
tion ON of the bistable) that stacks the purpose of calling Pa without calling it 
anymore. Then, a call is transmitted to Faa- When sequence aa — ab has been 
performed, the perceptive world is supposed to be in a context that allows Ca 
to occur subsequently to a call in Pa- The detection of this context, which is 
the temporal mechanism described in this section, triggers the OFF transition of 
bistable activity in Fa- The effect of this latter transition is first to stop storing 
the call that previously failed, and second to retry it, by calling Pa again (the 
call is popped out from the stack) . Due to the use of aa — ab, this call now allows 
e<j to occur, and Fa transmits a call to Fb as in the non failing case presented 
first. 

Context detection, which is crucial for the neuronal stacking allowed by the 
frontal cortex model, has to be robust to the intrinsic asynchronousness of per- 
ceptions. This context detection, which is the learning rule for frontal sequences 
in the model, is presented now and details concerning the use of this mechanism 
for robot control can be found in (Frezza-Buet and Alexandre 1998a) . 



Frontal cortex 



Posterior cortex 




Fig. 12. Prefrontal modeling framework. 
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Asynchronous context detection The purpose of the mechanism is to enable 
a bistable frontal unit Fi (cf. figure 12) to learn perceptive context. This context 
must ensure that a call on the corresponding Pi will succeed in getting the 
associated event e^. At the level of Fi, let us consider two events ef and e~ , 
respectively representing the success of call in Ai and its failure {Fi gives up 
calling Ai). Learning occurs during both and e~ , as detailed below. 

For any frontal connection between Fi and Fj, a couple of weights 
are used. Weights and store correlations between excitation in Pj and 
respective events e)*" and e~ . Using these weights, two kinds of contextual ac- 
tivities Ci and Ci are computed, according to equations 22 where operator 
returns x if x > 0 and 0 otherwise. 









x” 




= l-max{(l- 
= max { Aj X f'p{uj~j - 




(22) 



The value Ej is the trace of excitation activity of Pj (intensity of Cj modu- 
lated by its recency). The value x'l is the fitting of the Ej distribution to the 
stored weights. If no strong correlation has been detected between any Ej 
and ef (w)*)jjax weak), the w^j are not significant, the context xf is then good 
by default (close to 1). The context value xf is sensitive to changes in the Ej 
distribution, allowing the detection that the current distribution is going closer 
to the ideal one (defined by the wfj). As this context is too permissive for reliable 
context detection, another context value xf is defined, which is high when the 
Ej distribution matches exactly the w^j. Finally, if one of the current Ej has 
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been detected as being correlated to without being correlated to ef — ujfj 
is high) , this Ej is predictive of the failure of the call, and the x~ context value 
is high. Using the contexts xf , xf and x~ , two respectively permissive and not 
permissive contexts Ci and Ci are defined. 

As Ci is reliable, it is used to trigger the OFF transition of Fi. On the other 
hand Cj, being sensitive to the improvement of the distribution of perception, is 
used to sustain bistable activity of Fi until Ci allows the OFF transition. 

The learning rule for the is straightforward, once the previous values 

are computed (cf. equation 23). Note that the uj~j are computed only when the 
call fails but it was supposed to succeed (q > 0). 

For all j, 

When e'^ occurs 
r+ = r 

^ (1 - (23) 

When e~ occurs 

T~ = T X Ci 

^ (1 - 

Interest of the context learning mechanism The context detection is ro- 
bust (Frezza-Buet 1999) to asynchronous perception, due to correlation with 
traces. It also enables one to separate events Ej that are often occurring whene- 
ver a call in Pi succeeds {ef) or not (e“) from the events Ej that are responsible 
for the failure. This mechanism, coupled with stacking properties of bistable 
activation, is involved in the scheduling of action schemes, synchronizing calls 
towards the posterior cortex model by defining when a sub-sequence has suc- 
ceeded. Learned contexts are also used to determine which sub-sequence to call 
when a call fails. Finally, the same context mechanism has been reused, detecting 
distribution of call activities among the frontal units, to learn which schemes are 
not compatible with the execution of others. 

4.4 Discussion 

The design of mechanisms presented in this section is driven by engineering con- 
straints (robustness, integration into an efficient architecture), that lead us not 
just to stick to the biological background. Using biological inspiration for such 
purposes is constrained by a trade off between biological validity and computa- 
tional efficiency. Nevertheless, cortical modeling offers us a framework to develop 
highly integrated applications. These open architectures can then be refined and 
extended consistently. 

5 Conclusion 

The main goal of this chapter was to present a cortical framework in which iso- 
lated neurobiologically inspired mechanisms can be integrated. It was shown in 
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particular that these mechanisms can act at different levels of time, for different 
kinds of elementary functions. Data from cortical organization and functioning 
offer a framework for embedding these mechanisms in a way that yields neuro- 
biological plausibility at the neuronal as well as the behavioral level. 

As a conclusion, in a schematic way, we can now consider in turn the me- 
chanisms and their time scale, with the corresponding neuronal and behavioral 
description. 

1. The time scale of one millisecond corresponds, at the neuronal level, to the 
duration of a spike and to the duration of synaptic transmission from one 
neuron to its closest neighbors. It thus corresponds to the minimal time 
scale for a bit of information. At the behavioral level, coincidence detectors, 
in peripheral neural structures, can work at this time scale. 

2. The time scale of ten milliseconds corresponds, at the neuronal level in the 
central structures, to the coding level of spike intervals, since the maximal 
frequency cannot exceed 100 Hz in these structures. This thus corresponds 
to the minimal timing from one area to the next. At the behavioral level, it 
corresponds to the focus of attention or to the timing of feedback information 
flow from an higher level to a lower level map. 

3. The time scale of one hundred milliseconds corresponds, at the neuronal 
level, to the activation dynamics of a population of neurons, which are locally 
synchronized at this time scale. It is the basic timing of activities in the cortex 
which are linked with simple sensory and motor events. At the behavioral 
level, it corresponds to the minimal reaction times from the first processing 
layer (stimulus) to the last (recognition or action). It is thus the minimal 
time for the simplest sensorimotor loops. 

4. The time scale of one second corresponds, at the neuronal level, to the time 
scale of the basic processes which can result in learning. Neurons in the 
higher levels of associative cortex can stay active on this time scale, even 
if the stimulus is no longer present. At the behavioral level, it corresponds 
to the time scale of correspondences between sensory and motor events on 
different modalities which can produce reinforcement. It also corresponds 
to a level of information processing which has strong intrinsic regulations 
within modalities (for example, the exploration of an object). 

5. The time scale of ten seconds corresponds, at the neuronal level, to the 
typical time scale of working memory in the frontal regions of the cortex. The 
learning mechanism is performed by the control of bistable states of frontal 
neurons in order to build stacks. At the behavioral level, this time scale 
corresponds to the processes in the frontal cortex allowing the organization 
of temporal aspects of behavior like the exploration of a scene. 

6. The time scale of one hundred seconds and more corresponds, at the neuronal 
level, to the very long time constants of some neuronal intrinsic metabolic 
and genetic processes. Rhythms with long periods can be produced by such 
structures as the reticular formation and the hypothalamic nuclei. Such in- 
ternal clocks can influence cortical activity by modulators which can switch 
the intrinsic temporal programs of large populations of cortical neurons. At 
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the behavioral level, these biological rhythms can have large contextual in- 
fluence (like emotion) and can produce global regulation of the behavioral 
programs within the whole network. 



On the one hand, each basic temporal mechanism that we have presented 
above can be related to one (or two consecutive) of these temporal resolutions 
and to the corresponding neuronal and behavioral mechanisms. On the other 
hand it is clear that a fully plausible behavioral model should include all of the 
six levels of time. In any case, it should not be restricted to one or two levels. We 
believe that the cortical framework that we have presented here allows one to 
work at the same time at these different temporal levels, with the corresponding 
behavioral abilities. 
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1. Introduction 

Much sensory-motor behavior develops through imitation, as during the learning of 
handwriting by children (Burns, 1962; Freeman, 1914; lacoboni et al., 1999). Such 
complex sequential acts are broken down into distinct motor control synergies, or 
muscle groups, whose activities overlap in time to generate continuous, curved move- 
ments that obey an inverse relation between curvature and speed. How are such com- 
plex movements learned through attentive imitation? Novel movements may be made 
as a series of distinct segments that may be quite irregular both in space and time, but 
a practiced movement can be made smoothly, with a continuous, often bell-shaped, 
velocity profile. How does learning of sequential movements transform reactive imi- 
tation into predictive, automatic performance? 

A neural model is summarized here which suggests how parietal, frontal, and mo- 
tor cortical mechanisms, such as difference vector encoding, interact with adaptively- 
timed, predictive cerebellar learning during movement imitation and predictive per- 
formance (Grossberg & Paine, 2000). To initiate movement, visual attention shifts 
along the shape to be imitated and generates vector movement using motor cortical 
cells. During such an imitative movement, cerebellar Purkinje cells with a spectrum of 
delayed response profiles sample and learn the changing directional information and, 
in turn, send that learned information back to the cortex and eventually to the muscle 
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synergies involved. If the imitative movement deviates from an attentional focus 
around a shape to be imitated, the visual system shifts attention, and may make an eye 
movement back to the shape, thereby providing corrective directional information to 
the arm movement system. 

This imitative movement cycle repeats until the corticocerebellar system can accu- 
rately drive the movement based on memory alone. A cortical working memory buffer 
transiently stores the cerebellar output and releases it at a variable rate, allowing speed 
scaling of learned movements which is limited by the rate of cerebellar memory read- 
out. Movements can be learned at variable speeds if the density of the spectrum of 
delayed cellular responses in the cerebellum varies with speed. Learning at slower 
speeds facilitates learning at faster speeds. Size can be varied after learning while 
keeping the movement duration constant (isochrony). Context-effects arise from the 
overlap of cerebellar memory outputs. The model is used to simulate key psycho- 
physical and neural data about learning to make curved movements, including a de- 
crease in writing time as learning progresses; generation of unimodal, bell-shaped ve- 
locity profiles for each movement synergy; size and speed scaling with preservation of 
the letter shape and the shapes of the velocity profiles; an inverse relation between 
curvature and tangential velocity; and a Two-Thirds Power Law relation between an- 
gular velocity and curvature. 



2, Model Precursors 

The new model, called Adaptive VITEWRITE (AVITEWRITE), builds on two previ- 
ous movement models. The first is the Vector Integration to Endpoint (VITE) model 
(Bullock & Grossberg, 1988a, 1988b, 1991) (Eigure 1). The VITE model successfully 
explained psychophysical and neurobiologic al data about how synchronous multi-joint 
reaching trajectories could be generated at variable speeds. VITE was later expanded 
(Bullock, Cisek, & Grossberg, 1998) to explain how arm movements are influenced by 
proprioceptive feedback and external forces, among other related factors. The firing 
patterns of six distinct cell types in cortical areas 4 and 5 were also simulated during 
various movement tasks (Kalaska et al., 1990). In order to allow a greater focus on 
issues related to the learning of curved movements, the AVITEWRITE model avoids 
explicit descriptions of muscle dynamics, and therefore uses components of the earlier 
VITE models of Bullock and Grossberg (1988a, 1988b, 1991). 

A second basis for the AVITEWRITE model is the VITEWRITE model of Bul- 
lock, Grossberg, and Mannes (1993), (Eigure 2). The curved trajectories of handwrit- 
ing require more than simple point-to-point movements. Curved handwriting trajecto- 
ries appear to be generated by sequences of movement synergies (Bernstein, 1967; 
Kelso, 1982), or groups of muscles working together to drive the limb in prescribed 
directions, whose activities overlap in time (Morasso et al., 1983; Soechting & Ter- 
zuolo, 1987; Stelmach et al., 1984). VITEWRITE uses such a synergy-overlap strat- 
egy to generate curved movements from individual, target-driven strokes. A key issue 
faced by all models which seek to generate curves by overlapping strokes is how to 
appropriately time the strokes to generate a particular curve. VITEWRITE avoids an 
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explicit representation of time in the control of synergy activation by using features of 
the movement itself; namely, times of zero or of maximum velocity, to trigger activa- 
tion of a subsequent synergy. However, movement in VITEWRITE is controlled by a 
predefined sequence of "planning vectors" which cause unimodal velocity profiles for 
the synergies that control each directional component of a curve. VITEWRITE does 
not address how these planning vectors may be discovered, learned, and stored in a 
self-organizing process which can generate unimodal velocity profiles for each direc- 
tional component of a curved movement. This challenge is met by the Adaptive 
VITEWRITE model. 




Fig. 1. (a) A match interface within the VITE model continuously computes a differ- 
ence vector {DV) between the target position vector (TPV) and a present position vec- 
tor (PPV), and adds the difference vector to the present position vector, (b) A GO sig- 
nal gates execution of a primed movement vector and regulates the rate at which the 
movement vector updates the present position command. (Adapted with permission 
from Bullock and Grossberg, 1988a.) 



AVITEWRITE describes how the complex sequences of movements involved in 
handwriting can be learned through the imitation of previously drawn curves. Al- 
though the system described herein could be modified to learn from the actual move- 
ments of a teacher, the present model learns by imitating the product of that teacher’s 
movements, the static image of a written letter. AVITEWRITE shows how initially 
segmented movements with multimodal velocity profiles during the early stages of 
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learning can become the smooth, continuous movements with the unimodal, bell- 
shaped velocity profiles observed in adult humans Abend et ah, 1982; Edelman & 
Flash, 1987; Morasso, 1981; Morasso et ah, 1983) after multiple learning trials. Early, 
error-prone handwriting movements with many visually reactive, correctional compo- 
nents gradually improve over time and many learning trials, to become automatic, 
error-free movements which can even be performed without visual feedback. A key 
factor in this transition is the use of synergy-hased learning (see helow), which retains 
some degree of stability across otherwise highly variable practice trials. 




Fig. 2. The VITEWRITE model of Bullock et al. (1993). A Vector Plan functions as a 
motor program that stores discrete planning vectors DF in a working memory. A 
GRO signal determines the size of script and a GO signal its speed of execution. After 
the vector plan and these will-to-act signals are activated, the circuit generates script 
automatically. Size-scaled planning vectors DV^ GRO are read into a target position 
vector (TPV). An outflow representation of present position, the present position vec- 
tor (PPV), is subtracted from the TPV to define a movement difference vector {DVJ. 
The DV^ is multiplied by the GO signal. The net signal DV^ GO is integrated by the 
PPV until it equals the TPV. The signal DV^GO is thus an outflow representation of 
movement speed. Maxima or zero values of its cell activations may automatically 
trigger read-out of the next planning vector DV .. (Reproduced with permission from 
Bullock et al., 1993.) 
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The AVITEWRITE model architecture is briefly outlined below (Figure 3) and de- 
scribed later in detail in the Model Description (Figure 9). At the start of movement, 
visual attention (1) focuses on the current hand position and moves to select a target 
position (2) on the curve being traced. A Difference Vector representation (3) of the 
distance and direction to the target is formed between the current hand position (PPV) 
and the new target position (TPV). This Difference Vector activates the appropriate 
muscle synergy (4) to drive a reactive movement to that target. At the same time, a 
cerebellar adaptive timing system (5) (Fiala et ah, 1996) learns the activation pattern 
of the muscle synergy involved in the movement and begins to cooperate or compete 

(6) with reactive visual attention for control of the motor cortical trajectory generator 

(7) . A working memory (8) transiently stores learned motor commands to allow them 
to be executed at decreased speeds as the speed and size of trajectory generation are 
volitionally controlled through the basal ganglia (9). Reactive visual control takes over 
when memory causes mistakes. Both the movement trajectory and the memory are 
then corrected, allowing memory to take over control again. As successive, visually 
reactive movements are made to a series of attentionally chosen targets on the curve, a 
memory is formed of the muscle synergy activations needed to draw that curve. After 
tracing the curve multiple times, memory alone can yield error-free movements. 



Parietal Cortical 




Fig. 3. Conceptual diagram of the AVITEWRITE architecture. Numbers in parenthe- 
ses indicate the order of discussion in the text. 
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Several properties of human handwriting movements emerge when AVITEWRITE 
learns to write a letter. Size and speed can be volitionally varied (Figure 3, (9)) after 
learning while preserving letter shape and the shapes of the velocity profiles (Plamon- 
don & Alimi, 1997; Schillings et al., 1996; van Galen & Weber, 1998; Wann & 
Nimmo-Smith, 1990; Wright, 1993). Isochrony, the tendency for humans to write let- 
ters of different sizes in the same amount of time, is also demonstrated (Thomassen & 
Teulings, 1985; Wright, 1993). Speed can be varied during learning, and learning at 
slower speeds facilitates future learning at faster speeds (Alston & Taylor, 1987, p. 
115; Bums, 1962, pp. 4 45-46; Freeman, 1914, pp. 83-84). Unimodal, bell-shaped ve- 
locity profiles for each movement synergy emerge as a letter is learned, and they 
closely resemble the velocity profiles of adult humans writing those letters (Abend et 
al., 1982; Edelman & Flash, 1987; Morasso, 1981; Morasso et al., 1983). An inverse 
relation between curvature and tangential velocity is observed in the model’s perform- 
ance (Lacquaniti et al., 1983). It also yields a Two-Thirds Power Law relation be- 
tween angular velocity and curvature, as seen in human writing under certain condi- 
tions (Lacquaniti et al., 1983; Thomassen & Teulings, 1985; Wann et al., 1988). Fi- 
nally, context effects become apparent when AVITEWRITE generates multiple con- 
nected letters, reminiscent of carryover coarticulation in speech (Hertrich & Acker- 
mann, 1995; Ostry et al., 1996), and similar to handwriting context effects reported by 
Greer and Green (1983), and Thomassen and Schomaker (1986). 



3. Movement Synergies 

Movement synergies are groups of muscles that work together in a common task. The 
brain seems to control complex movement tasks, such as walking or handwriting, by 
issuing commands to a few muscle synergies, as opposed to specifying the movement 
parameters for scores of individual muscles separately (Bizzi et al., 1998; Buchanan et 
al., 1986; Kelso, 1982; Turvey, 1990). Using muscle synergies greatly simplifies the 
control and planning of movement by lessening the number of degrees of freedom 
requiring executive control (Bernstein, 1967; Turvey, 1990). Only at lower levels of 
the central nervous system, such as in the brainstem and spinal cord, would the motor 
synergy commands branch out to individual muscles. A key question is how these 
movement synergies are controlled. 

Human movements can be broken down into individual movement segments, or 
strokes. Each stroke corresponds to the activities of particular muscle synergies. When 
the muscle synergies controlling a limb are activated synchronously (Figure 4b), there 
is a tendency to make simple, straight movements (Hollerbach & Flash, 1982; Mo- 
rasso, 1986) with bell-shaped velocity profiles (Abend et al., 1982; Morasso, 1981; 
Morasso et al., 1983), (Figure 4b). Curved movements may be generated by a linear 
superposition of straight strokes due to asynchronous synergies (Figure 4a and 4c) 
(Morasso et al., 1983; Soechting & Terzuolo, 1987; Stelmach et al., 1984). Thus, a 
key issue is how the timing of strokes is determined. In curved movements, each syn- 
ergy generates its own bell-shaped velocity profile. A simple example is a "U" curve 
(Figure 5), drawn as a combination of three strokes: one for a synergy in the negative. 
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vertical direction; a second in the positive, horizontal direction; and a final stroke in 
the positive, vertical direction (Figures 5b and 5c). The observation that the curved 
movements of handwriting obey an inverse relation between curvature and velocity 
(Lacquaniti et ah, 1983) can be attributed to the direction reversal and synergy 
switching which occurs at points of high curvature, as at the bottom of a "U" curve 
(Figure 5d and 5e). 




Fig. 4. Varying the relative timing of synergy activation can yield different curved 
movements. Synchronous synergy activation yields straight movements (b) while 
asynchronous synergy activation can yields curved movements in (a) and (c). The 
dotted and solid curves represent synergies that control movements in the positive y 
and X directions, respectively. 



4, The VITE Model of Reaching 

How is movement direction represented in the brain? Much research (e.g., Andersen et 
al., 1995; Georgopoulos et al., 1982, 1989, 1993; Mussa-Ivaldi, 1988) suggests that 
motor and parietal cortex compute a vectorial representation of movement direction in 
motor and/or spatial coordinates. The VITE model (Bullock & Grossberg, 1988a, 
1988b, 1991) showed how a vectorial representation of movement direction and length 
could generate straight reaching movements with bell-shaped velocity profiles (Figure 
1). Such a Difference Vector (DF) is computed as the difference from an outflow rep- 
resentation of the current hand position, or Present Position Vector (PPV), to a target, 
or Target Position Vector (TPV) (Figure 6). The DV is multiplied by a gradually in- 
creasing GO signal, that is under volitional control, whose growth rate can be changed 
to alter movement speed while preserving movement direction and length (Figure 1). 
The "GO" signal seems to be generated with in the basal ganglia (Hallett & Khoshbin, 
1980; Georgopoulos et al., 1983; Horak & Anderson, 1984a, 1984b; Berardelli et al., 
1996; Turner & Anderson, 1997; Turner et al., 1998). The DV times the GO signal is 
an outflow representation of movement velocity that is integrated at the PPV until the 
present position of the hand reaches the target. 

The VITE model explains behavioral and neural data about how a motor synergy 
can be commanded to generate a synchronous, multi-joint reaching trajectory at vari- 
able speeds. VITE describes how synchronous movements may be generated across 
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synergistic muscles with automatic compensation for the different total contractions 
undergone by each muscle group. Many properties of human reaching movements 
emerge from VITE’s performance, including the equifinality of movement synergies, a 
rate-dependence of velocity profile asymmetries, and variations in the ratio of maxi- 
mum to average movement velocities (Bullock & Grossberg, 1988a, 1988b, 1991). 




(e) 

Fig. 5. (a) A "U" curve written by a human; (b) and (c): x andj direction velocity pro- 
files, respectively; (d) movement curvature; (e) tangential velocity. (Reproduced with 
permission from Edelman and Flash, 1987.) 




Fig. 6. (a) Illustration of a Difference Vector (DV) formed from the current hand lo- 
cation, given by a Present Position Vector (PPV), to a Target Position Vector (TPV). 
The DV is integrated in a VITE circuit to generate a straight movement with a bell- 
shaped velocity profile (b). 

The expanded VITE model of Bullock, Cisek, and Grossberg (1998) assigned 
functional roles to six cell types in movement-related, primate cortical areas 4 and 5, 
and integrated them into a system which is capable of "continuous trajectory forma- 
tion; priming, gating, and scaling of movement commands; static and inertial load 
compensation; and proprioception" (Bullock et al., 1998, p. 48). In this more detailed 
model. Difference Vector cells resemble the activity of posterior parietal area 5 phasic 
cells, while Present Position Vector cells behave like anterior area 5 tonic cells. 
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The VITEWRITE model of Bullock, Grossberg, and Mannes (1993) (Figure 2) ex- 
tended the VITE reaching model to explain handwriting data. In VITEWRITE, curved 
movements are generated using a velocity-dependent stroke -launching rule that allows 
asynchronous superposition of sequential muscle synergy activations with unimodal, 
bell-shaped velocity profiles for each synergy. Scaling the size ofDFs by multiplica- 
tion with a volitional GRO signal allows size scaling without significantly altering the 
trajectory shape or the shape of the velocity profile. Similarly, altering the size of the 
volitional GO signal alters trajectory speed without changing trajectory shape. The 
movements generated by VITEWRITE yield the inverse relation between curvature 
and tangential velocity observed in human performance, as well as the Two-Thirds 
Power law relation between angular velocity and curvature observed in humans under 
some writing conditions (Lacquaniti et al., 1983; Thomassen & Teulings, 1985; Wann 
et al., 1988). VITEWRITE also shows how size scaling of individual synergies via 
separate GRO signals can change the style of writing without altering velocity profile 
shape. Such independent scaling of muscle synergy commands is supported by the 
study of Wann and Nimmo-Smith (1990), which yielded data that "do not support 
common scaling for x and y dimensions" (p. 1 1 1). 
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Fig. 7. Cerebellar spectral timing circuit: Long Term Depression (LTD) occurs over at 
the parallel fiber-Purkinje cell synapse when an unconditioned stimulus (US) is paired 
with a conditioned stimulus (CS) over multiple presentations. (Adapted with permis- 
sion from Grossberg and Merrill, 1996). See text for details. 



The Adaptive VITEWRITE model yields performance which is equally consistent 
with available handwriting data. In addition, AVITEWRITE addresses the main limi- 
tation of VITEWRITE, which is its inability to learn and remember the motor plan 
that, once learned, yields such good performance. The original VITEWRITE model 
does not address "the self-organizing process that discovers, learns, and stores repre- 
sentations of movement commands" (Bullock et al., 1993, p. 22). The pattern of 
"planning vectors" which formed VITEWRITE ’s motor program, or plan, needed to be 
predefined in order for the system to generate a movement or write a particular letter. 
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In contrast, AVITEWRITE learns how to generate letters by itself, and then remem- 
bers how to write them. The cost in so doing is a considerably larger learned memory. 
It remains to be seen whether and how the very parsimonious synergy-launching rule 
that was used in VITEWRITE can be assimilated into this learning scheme. 



6. Adaptive Timing in the Cerebellum 



Given that curved movements may be generated by asynchronous activation of multi- 
ple muscle synergies, we need to understand how the time-varying activation of asyn- 
chronous muscle synergies, or strokes, is learned. Several mechanisms have been pro- 
posed to learn how to adaptively time responses to stimuli. Possible timing 
mechani<sms include delay lines (Moore et al., 1989; Zipser, 1986), a spectrum of 
slow responses with different reaction rates in a population of neurons (Bartha et al., 
1991; Bullock et al., 1994; Grossberg & Merrill, 1992, 1996; Grossberg & Schmajuk, 
1989; Jaffe, 1992), and temporal evolution of the network activity pattern (Buono- 
mano & Mauk, 1994; Chapeau-Blondeau & Chauvet, 1991). Given the need to learn 
time delays of up to four seconds in eye blink conditioning, delay lines of sufficient 
length do not appear to be present in the cerebellar cortex (Fiala et al., 1996; Freeman, 
1969). Network noise over a four second interval seems to preclude temporal network 
evolution mechanisms (Buonomano & Mauk, 1994; Fiala et al., 1996). 

Accumulating evidence suggests that adaptively timed learning of strokes may be 
achieved by spectral timing in the cerebellum. Fiala et al. (1996) and others (Ito, 1984; 
Perrett et al., 1993) have suggested that the cerebellum may be involved in the open- 
ing of a timed gate to express a learned motor gain, as when a rabbit learns to blink 
after hearing a tone previously associated with an air puff In this conception (Figure 
7), a signal associated with a Conditioned Stimulus (CS) arrives via the cerebellar 
(mossy fiber)-to-(parallel fiber) pathway at a population of Purkinje cells and triggers 
a series of phase-delayed activation profiles, or depolarizations, of the Purkinje cells, 
called a Purkinje cell "spectrum" (Figure 8b). When a signal associated with a subse- 
quent Unconditioned Stimulus (US) arrives via climbing fibers at some fixed Inter- 
stimulus Interval (ISI) after the CS, then Long Term Depression (LTD) of active 
Purkinje cells may occur at that time (Figure 8a), leading to disinhibition of the cere- 
bellar nuclei at that time (Figure 7); hence the term "adaptive timing" (Fiala et al., 
1996; Grossberg & Merrill, 1992, 1996; Grossberg & Schmajuk, 1989). The staggered 
temporal pattern of Purkinje cell depolarizations following the initial CS ensures that 
some Purkinje cells will be active, and subject to Long Term Depression, at the time 
that the US arrives via the climbing fibers (Figure 8a). 
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Fig. 8. (a) Purkinje cell spectrum (bottom) and adaptively timed Long Term Depres- 
sion (LTD) over multiple CS-US pairings. As the unconditioned stimulus (US) arrives 
over multiple learning trials at a fixed interstimulus interval after the conditioned 
stimulus (CS), LTD occurs at those Purkinje cells which are active when the US ar- 
rives (shaded response curves), (b) Purkinje cell depolarization spectrum from Fiala et 
al. (1996) equations. Continuous glutamate input =10 microM. (Adapted with permis- 
sion from Fiala et al., 1996.) 

Fiala et al. (1996) utilized biochemical mechanisms of the metabotropic glutamate 
receptor (mGluR) system to simulate how learning of adaptively timed Long Term 
Depression, or LTD, of Purkinje cells occurs and causes disinhibition of cerebellar 
nuclei during classical conditioning. Fiala et al. (1996) showed that a Purkinje cell 
spectrum could learn to respond to two conditioned stimuli with different interstimulus 
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intervals (p. 3770). AVITEWRITE takes this approach one step fnrther. Instead of 
learning one or two responses at discrete points in time, as in the conditioning task, it 
is hypothesized that the cerebellar adaptive timing mechanism can also learn a con- 
tinuous response over time in sequential tasks like handwriting. 

For a continuous handwriting task, different Purkinje cell spectra are activated by 
the commands corresponding to different muscle synergies. The climbing fiber uncon- 
ditioned stimuli act as error-based signals that train the Purkinje cells to become hy- 
perpolarized in specific temporal patterns that lead to correctly shaped writing move- 
ments. The level of depression of a given Purkinje cell determines the extent of cere- 
bellar nucleus disinhibition during that Purkinje cell’s activation and thus the learned 
gains for controlling a particular muscle synergy during a brief time window of 
movement. When these brief, individual movement commands are summed over the 
entire Purkinje cell population with staggered, overlapping cell activations, a continu- 
ously changing pattern of muscle synergy activations may be generated which can 
yield curved planned movements. In the AVITEWRITE model, this cerebellar adap- 
tive timing module forms part of an integrated sequential learning and generation sys- 
tem (Figures 3 and 9) that also uses elements of VITE cortical and basal ganglia tra- 
jectory formation for visually reactive movements to targets and synergy-based spec- 
tral activation, as well as ideas from VITEWRITE about how working memories can 
build curved movements from overlapping synergies in a way that preserves shape- 
invariant volitional speed and size scaling. 

This view of a cerebellar role in handwriting is consistent with data showing that 
there is cerebellar activity during drawing, and that the cerebellum is more active 
when lines are retraced than in new line generation because error detection (deviation 
from the lines) occurs during retracing but not new line generation (Jueptner & Weil- 
ler, 1998). Since the cerebellum is more active during error corrections, it is likely that 
climbing fibers (Figure 7) are signaling movement error, leading to Long-Term De- 
pression or LTD of Purkinje cell-parallel fiber synapses (Ge liman et al., 1985; Ito, 
1991; Ito & Karachot, 1992; Oscarsson, 1969; Simpson et al., 1996). In a similar vein, 
the cerebellum may also be involved in a variety of complex sequential tasks. It is 
known that there is a cerebellar role in procedural memory. In a sequential button 
press task, lesions to the dentate nucleus cause deficits in learning and memory (Lu et 
al., 1998). Further, Doyon et al. (1998) demonstrated through studies using a sequen- 
tial finger movement task that the cerebellum and striatum are involved in the auto- 
matization and long-term retention of motor sequence behavior. 

In addition to showing how the cerebellum may be involved in learning a 
sequential handwriting task, the AVITEWRITE model also shows how the cerebellum 
may encode movement velocity. It is known that Purkinje cell simple spike discharge 
is direction- and speed-dependent (Coltz et al., 199a; Ebner, 1998). Simple spikes re- 
sult from summation of excitatory postsynaptic potentials at parallel fiber-Purkinje 
cell synapses, across multiple Purkinje cell dendrites (Ghez, 1991, p. 631). AVITE- 
WRITE assumes that movement context information, such as the movement direction 
and speed, is carried via the parallel fibers to the Purkinje cell populations controlling 
particular muscle synergies. Further, complex spike discharge of Purkinje cells is 
"spatially tuned and strongly related to movement kinematics" (Fu et al., 1997). A 
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complex spike results when a single action potential is carried to a Purkinje cell via a 
climbing fiber, triggering a large Purkinje cell action potential followed by a high-fre- 
quency burst of smaller action potentials (Ghez, 1991, p. 631). In AVITEWRITE, the 
climbing fiber inputs act as error-correcting signals which train Purkinje cells that 
control particular muscle synergies to become hyperpolarized at the appropriate times 
during movement. AVITEWRITE therefore assumes that the climbing fiber signal is 
dependent on the direction and amplitude of a required corrective movement. The re- 
quired corrective movement is different from, and possibly in the opposite direction 
to, the actual movement of that particular muscle synergy, which is reflected in simple 
spike activity. In fact, Coltz et al. (1999b) have found that complex spike discharge is 
direction- and speed-dependent, and that it is related to directions opposite those of the 
corresponding simple spikes, and to speeds different from those of the simple spikes. 
This appears to be further evidence that climbing fibers transmit a movement error 
signal. The model suggests how, using a spectrum of phase-delayed Purkinje cell acti- 
vations based on adaptive timing mechanisms, learned cerebellar outputs may code 
movement gain and velocity. 



7. AVITEWRITE Model 

AVITEWRITE uses visual spatial attention to determine where the hand will move 
to imitate a curve. Attention is modeled algorithmically since it is not the main focus 
of the present study. The model assumes that attention may be focused within a circu- 
lar region around the present fixation point. Attention is initially focused around the 
current hand position on a template curve (Figure 9). Attention then shifts along the 
curve to another target {TPV\ Target Position Vector) on the shape that lies within an 
attentional radius of the current hand position (PPV: Present Position Vector). How 
this is modeled will be more explicitly stated below. 

In support of the model’s use of spatial attention, experimental data suggest that 
superior frontal, inferior parietal, and superior temporal cortex are part of a network 
for voluntary attentional control (Hopfinger et al., 2000) which is critical for directing 
"unpracticed movements in man" (Richer et al., 1999, p. 1427). Jueptner et al. (1997a, 
1997b) reported that the prefrontal cortex was activated in a finger movement- 
sequence learning task during new learning but not during automatic performance af- 
ter learning. Further, the left dorsal prefrontal cortex was reactivated "when subjects 
paid attention to the performance of the pre learned sequence" (Jueptner et al., 1997b, 
p. 1313). Evidence for an interaction between parietal and frontal lobe activity and 
cerebellar activity was found by Arroyo-Anllo and Botez-Marquard (1998). The 
authors found that humans with olivopontocerebellar atrophy suffered deficits in 
copying a simple figure and in immediate visual spatial memory, "consistent with the 
hypothesis that the cerebellum is involved in visual spatial working memory... and that 
it modulates parietal lobe- and frontal lobe-mediated functions" (p. 52). 
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Fig. 9. AVITEWRITE architecture: cf = climbing fiber; = Gating Difference 

Vector; DV^= Size-scaled, memory-enhanced Difference Vector; = Visual Dif- 
ference Vector; GO = Volitional speed control signal; GRO = Volitional size control 
signal; mf = mossy fiber; PC = Purkinje cell; PPV = Present Position Vector; R = 
Adaptively timed cerebellar output; TPV = Target Position Vector; TPV^ = Memory- 
modulated Target Position Vector; WM= Spectral Working Memory Buffer output. 

AVITEWRITE uses spatial attention to constrain the choice of the target positions 
that drive imitative tracing of a curve. The model assumes that these targets are se- 
lected within an attentional "tube" that is swept out by shifts in attention around the 
curve (Figure 10). If there is no memory, or if movement deviates from the attentional 
radius around the curve being traced due to memory inaccuracy, then a new target is 
chosen on the curve. Each choice of a new TPV from the current PPV defines a visual 
Difference Vector, or DV^.^, that is constrained to point forward along the template 
curve and remain within an attentional radius (rj of it, or else return the hand to 
within a distance of the curve if it has exceeded it. More details about the target se- 
lection algorithm are described below. The TPFs are used to form difference vectors, 
DV^.^, that both drive the movement and act as teaching signals to train a cerebellar 
spectral memory via climbing fiber inputs. 
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Fig. 10. Illustration of target selection: (a) Targets are chosen so as to keep the move- 
ment within an attentional radius, depicted as a circle around the current hand/pencil 
tip position, of the curve being traced. Superposition of these circular foci of attention 
as attention shifts across space generates an attentional "tube" around the template 
curve, shown as dotted lines, (b) Target 1 is possible because movement to it would 
not exceed the attentional radius, r^, from the curve being traced, whereas Target 2 is 
invalid because would be exceeded. 

Once a target is chosen, vision provides direction and amplitude information, in the 
form of the difference vector, DV^-^, to a trajectory generator which can combine tem- 
porally overlapping muscle synergy activations to generate curved movements whose 
speed and size are volitionally controlled. Evidence for the use of visual difference 
vectors has been reported by several investigators. For example, in a study of human 
visuomanual pointing to a visual target on a horizontal plane, Vindras and Viviani 
(1998) found that final hand position appeared to be "coded as a vector represented in 
an extrinsic frame of reference centered on the hand" (p. 569). Similarly, in the 
Schwartz and Moran (1999) study of cell population vectors in motor and premotor 
cortex during drawing movements, "population vectors predicted direction (vector 
angle) and speed (vector length) throughout the drawing task" and that the "2/3 power 
law described for human drawing was also evident in the neural correlate of the mon- 
key hand trajectory" (p. 2705). 

Forming a visual difference vector to a target on the template curve includes acti- 
vation of the appropriate muscle synergy to generate movement to that target. The 
trajectory generator then starts to integrate the memory-enhanced difference vector, 
DV^ generating a velocity vector that drives movement to the target (Figure 9). At the 
beginning of learning, when there is not yet a memory contribution to movement con- 
trol, DVg equals 79 multiplied by a volitional size-scaling GRO factor. While 
movement towards the visual target is occurring, adaptively timed learning of the 
muscle synergy activations required to reach that target occurs. The cerebellum is ac- 
tivated by active muscle synergies. The model assumes that different synergies ac- 
tivate different spectral memories to learn and store corrective movement commands. 
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In the simulations (Figures 13), four separate spectral memories are formed for posi- 
tive and negative, horizontal and vertical movement synergies, respectively. The use 
of separate synergy-activated spectral memories allows the model to learn a consistent 
movement despite the existence of variable errors on learning trials. It also allows 
muscle synergy-switching with independent control of each synergy. 

A new synergy is activated at the start of movement and whenever there is a rever- 
sal in movement direction, requiring activation of a different synergistic set of mus- 
cles. Prior to learning, the synergies needed to begin a movement are determined by 
the value of DV^^. For example, when starting the letter "U" when there is no prior 
memory of this letter, a is formed which initially points in the negative y and 
positive x-directions. Purkinje cell spectra corresponding to the negative y and positive 
x-direction synergies therefore begin sampling the climbing fiber error/teaching sig- 
nal. As memory starts to form, the model assumes that a visual representation of the 
letter is categorized by inferotemporal and prefrontal mechanisms in the "what" corti- 
cal processing stream, and that a visual cue is used to sample the appropriate synergies 
used to perform a given letter from memory (Figure 1 1). Although not modeled ex- 
plicitly, AVITEWRITE assumes that a working memory, possibly in prefrontal cortex, 
forms a category representation of each letter which controls adaptive pathways to all 
the synergies. The letter category determines which cerebellar spectra, corresponding 
to the particular synergies needed to write that letter, are activated via mossy fiber 
inputs. Only those adaptive pathways that were modified due to prior learning will 
read-out nonzero values of the cerebellar spectral memory output, R. In order to initi- 
ate writing of a learned letter, the letter category triggers the initial spectra that control 
the synergies needed to start the movement. When writing the letter "U" for example, 
the letter category memory activates spectra corresponding to the negative y and posi- 
tive x-direction synergies at the beginning of movement. The letter category repre- 
sentation also stores the identities of the other (the positive y) spectra involved in gen- 
erating that particular letter. Their order of activation is determined automatically by 
the synergy switching rule described below. 

Synergy switching is accomplished as follows in the model. If the total movement 
direction, determined by the sum of the reactive visual Difference Vector (DF.) and 
the cerebellar spectral memory {R) in Figure 9, changes sign, then a new synergy and 
Purkinje cell spectrum are activated. No new spectral components are activated in the 
spectrum from the prior synergy, although those components which are active at the 
time of the synergy switch continue to respond until they decay spontaneously. Such 
spectral behavior is supported by the responses of the biochemically-detailed Fiala et 
al. (1996) model to the sudden cessation of glutamate input to the Purkinje cells from 
the parallel fibers. In the Fiala et al. (1996) simulations, spectral components which 
are active at the time of input cessation remain active for a time while decaying spon- 
taneously, whereas no new spectral components respond once the glutamate input has 
been shut off The term spectral activity is here used to indicate the time-varying 
change in Ca^^ concentration and potential of a Purkinje cell following parallel fiber 
inputs. When writing a letter "U", a negative y-direction muscle synergy starts the 
movement. One Purkinje cell spectrum would learn to correct all the negative y- 
synergy movement errors. At the bottom of the "U", the y-synergy would reverse. 
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triggering activation of a new spectrum to learn to correct the positive y-synergy er- 
rors. At this point, input to the negative y-synergy spectrum would be stopped; e.g., by 
shutting off the glutamate input released from parallel fibers in the Fiala et al. (1996) 
model equations, and the spectra active at the time of the direction reversal would de- 
cay. 




Fig. 11. Working Memory {WM) representation of a letter category that determines 
the sequential order of readout of synergy specific spectra for the positive and nega- 
tive, X and y synergies, x+, x-, y+, and y-. Synergy switching is triggered by a change 
in sign of the total movement direction, + K. mf = mossy fiber. 

Error-driven movement learning is mediated by climbing fiber error signals, based 
on the Difference Vector TP F-PPF between the target position and the present hand 
position. The climbing fiber signal modifies the parallel fiber/Purkinje cell synaptic 
efficacy by triggering patterns of Long Term Depression across the Purkinje cell 
populations that control the respective muscle synergies. As the Purkinje cells’ activity 
becomes more depressed, their target cerebellar nucleus becomes disinhibited (Figure 
7), thereby enhancing muscle synergy activation over time according to the temporal 
pattern of Purkinje cell population activity. 

The AVITEWRITE model incorporates competition between reactive movement 
and memory-based movement control systems. The model hypothesizes that the cere- 
bellar motor memory competes for control of movement with cortical areas that guide 
reactive movements based on visual input (Caminiti et al., 1999; Dagher et al., 1999; 
Jueptner et al., 1997a, 1997b; Jueptner & Weiller, 1998; Kawashima et al., 2000; 
Sadato et al., 1996). In the model, the reactive visual difference vector (DF„) and the 
learned output from cerebellar memory (P), transiently stored in a working memory 
buffer (fFAf) described below, are combined to form the Memory-Enhanced Differ- 
ence Vector, The is, in turn, multiplied by a volitional size-scaling GKO 
signal to yield the size-scaled, memory-enhanced Difference Vector, DF^. When the 
memory contribution to DF^ is strong enough, then the cerebellar memory determines 
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DV^ and DV^-^ decays to zero. A visual difference vector (DV^J will be formed to a 
target if either of two conditions is met: 

First, if the memory is too small (below some threshold value), then the system 
waits for a brief period of time in case another memory is becoming active. If no 
memory grows beyond the threshold by the end of this time period, then a reactive 
visual is formed in the manner described above. This drives the reactive 
movement toward a target. Second, if an error is made due to a movement deviating 
from the attentional radius around the template curve, then a corrective visual DV^.^ is 
formed which determines DV^ and drives a corrective movement. The difference be- 
tween the target and present positions (TPV-PPV) generates a cerebellar teaching sig- 
nal that updates the memory. Memory again takes over control once the trajectory re- 
enters the attentional focus around the template curve, at which time decays to 
zero. Thus, on-line error correction occurs which automatically shuts off as the system 
successfully learns to generate the desired curve. As learning proceeds, error-prone 
movements become successively more accurate until no errors are made and memory 
alone controls the movement. Once memory can control the movement without errors, 
the learned movement can be correctly executed without visual feedback. 

As in the original VITEWRITE model Bullock et al. (1993), a volitional GO signal 
scales movement speed in AVITE WRITE by altering the trajectory generator’s rate of 
difference vector (DV^) integration. However, the rate of predefined memory "plan- 
ning vector" readout in VITEWRITE was a function of the movement’s velocity. It is 
still unclear how such a rule can hold across learning trials during which initial vari- 
ability in strokes and speeds eventually converges to a unimodal velocity profile. 

When one turns to spectral learning to overcome this difficulty, one needs to face a 
different problem; namely, the rate with which cerebellar Purkinje cells can read out 
the synaptic weights that form their motor memory is limited. In other words, at- 
tempting to alter movement speed by changing the GO signal by a factor of 2.8 to 
match the range of human speeds (Wright, 1993) would not necessarily alter the rate 
at which the cerebellum reads out its stored motor commands by a comparable factor. 
AVITEWRITE hypothesizes that the rate at which the motor commands are retrieved 
from cerebellar long term memory defines the maximum possible rate at which error- 
free, memory-driven sequential handwriting movements can be made. 

How can learned movements be made across a wide range of speeds while keeping 
trajectory shape and velocity profiles relatively constant if the variability of the long 
term motor memory readout rate is limited? Van Galen (1991) suggested that working 
memory buffers between handwriting "processing modules" may "accommodate for 
time frictions between information processing activities in different modules" (p. 182). 
AVITEWRITE hypothesizes that a working memory system helps to write at a wide 
range of speeds even if the read-out rate of cerebellar spectra does not change. This 
working memory system, with movement speed-dependent motor command readout, 
is not to be confused with the prefrontal working memory assumed to store letter cate- 
gory representations discussed earlier but not explicitly modeled in AVITEWRITE. 
Experimental data support the idea that working memory function may influence 
movement speed. For example, several authors have found that lesions causing spatial 
working memory deficits also cause increased movement speed. Ventral hippocampal 
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lesions (Bannerman et al., 1999), cholinergic basal forebrain lesions (Waite et al., 
1995), and NMDA receptor antagonism (Kretschmer & Fink, 1999) impair both spa- 
tial working memory and cause an increase in movement speed. Pleskacheva et al. 
(2000) found that voles with smaller hippocampal mossy fiber projections exhibited 
poorer spatial working memory and increased movement speed. Zhou et al. (1999) 
found that some neurons in the medial and lateral areas of the septal complex, which 
has close reciprocal connections with the hippocampus, display movement speed- 
related activity. Chieffi & Allport (1997) found support for the hypothesis that "short- 
term memory for a visually-presented location within reaching space" is represented in 
a "motoric code" (p. 244). 

The AVITEWRITE model hypothesizes that the learned cerebellar movement 
commands are transiently stored in a working memory buffer (fFM in Figure 9) which 
can read out those commands at a variable rate which is less than or equal to the rate 
at which motor commands are retrieved from the cerebellar spectral memory. The 
motor commands stored in the working memory are combined with the reactive visual 
difference vector (DV^J and scaled by the volitional, size-controlling GRO signal to 
form the memory-enhanced, size-scaled difference vector (DV^) discussed above. A 
memory-modulated movement target (TPVJ is generated from the memory-enhanced 
difference vector by adding DV^ to the current value of TPV^. At the beginning of 
movement, TPV^ is initialized to the starting position of the hand; that is, to the initial 
value of the Present Position Vector (PPV). 

When an animal is making sequential movements to a series of targets, it must 
read out the next target from working memory as it reaches the current target in order 
to continue the sequence. In AVITEWRITE, a subsequent motor command is loaded 
from working memory and executed only when the previous memory-modulated tar- 
get (TPVJ is reached. A memory-derived target has been reached when the present 
hand position (PPV) equals the position of TPV^. The difference vector from PPV to 
TPV_j^ is defined as (Figure 9). Thus, when 79 reaches zero or becomes nega- 
tive, TPV^ has been reached and the next command is loaded from the working mem- 
ory buffer (WAT). (Alternatively, one could use a small, non-zero threshold value of 
to trigger WM readout.) The working memory of AVITEWRITE allows the 
volitionally controlled GO signal to alter movement speeds of both reactive and 
learned movements, while preserving trajectory shape and the shapes of the velocity 
profiles, by altering the rate of memory readout relative to the speed of the movement. 
The maximum speed at which a learned movement can be executed without error is 
determined by the rate of long term memory readout from the cerebellar spectral 
memory. In the model, removal of the cortical working memory buffer impairs the 
system’s ability to decrease the speed of learned movements while preserving their 
kinematic features, such as shape and velocity profile invariance. The model hereby 
offers one possible explanation for the experimentally observed movement speed in- 
creases following spatial working memory impairment. 
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Fig. 12. (a) Purkinje cell depolarization spectrum from the Fiala et al. (1996) equa- 
tions. Continuous glutamate input = 5 microM. (b) Continuous glutamate input = 25 
microM. Note that the spectrum is more dense and spans a shorter time than in (a). 



One consequence of decreasing movement speed and the rate of motor command 
readout from the working memory buffer is that visual error feedback will be delayed. 
If the Purkinje cells responsible for triggering the erroneous movement have returned 
to their baseline activity by the time that the error feedback arrives via climbing fibers, 
then the parallel fiber/Purkinje cell synaptic weights will not be modified and the error 
will be repeated on the next learning trial. Such late error feedback may "correct" the 
wrong synaptic weights if other Purkinje cells in the population are active at the time 
that the climbing fiber signal arrives. A corrective movement could still be learned by 
modifying the weights of the Purkinje cells which are active when the error signal 
arrives, but it could be too late for it to significantly improve the movement trajectory. 
Further, it might even worsen performance if the curvature of the template curve near 
the current position of the moving hand has changed since the time the error occurred 
and the corrective movement points away from the curve at the time it is made. 
AVITEWRITE proposes the following solution to the problem of delayed error feed- 
back to the cerebellar Purkinje cell spectrum. This solution is consistent with the fact 
that increasing the conditioned stimulus intensity can "speed up the clock" in the rabbit 
nictitating membrane paradigm which earlier versions of spectral learning were used 
to model (Grossberg & Schmajuk, 1989, p. 93). In the model, the density of the 
Purkinje cell responses over time varies during learning as a function of the volition- 





Attentive Learning of Sequential Handwriting Movements: A Neural Network Model 369 

ally controlled GO signal that controls movement speed. For learning at slow move- 
ment speeds, the density of Purkinje cell responses over time is decreased. This de- 
creased density allows the activities of the Purkinje cells responsible for a given com- 
ponent of a movement synergy command to span a greater period of time so that more 
of them may be active at the time that the error feedback arrives. As speed increases, 
error feedback arrives sooner and Purkinje cell spectral density increases so that more 
cells are active sooner to sample the earlier error feedback. Simulations of the bio- 
chemically-predictive spectral timing model of Fiala et al. (1996) demonstrated that 
the rate of Purkinje cell response - that is, the spectral density - can be decreased by 
decreasing the amount of glutamate released at the parallel fiber/Purkinje cell synapse 
(Figure 12). By varying spectral density with speed in AVITEWRITE, successful 
learning may occur over a wider range of speeds. The mathematical equations and 
parameters that define the model are given in Grossberg and Paine (2000). 



8. Model Simulations 

Computer simulations illustrate the following model prooperties: (1) the model's abil- 
ity to learn to generate cursive letters with realistic velocity profiles; (2) generation of 
an inverse relation between curvature and tangential velocity; (3) generation of a Two- 
Thirds Power Law relation between curvature and velocity; (4) the ability to vary the 
movement speed during learning, with a gradual increase in speed as learning pro- 
ceeds; (5) variable speed performance of learned movements with preservation of the 
movement shape and the shape of the velocity profile; (6) the ability to vary the size of 
movements while maintaining isochrony as well as the shape of the velocity profiles; 
and (7) the ability to yield coarticulatory context effects, such as variation of letter size 
and downstroke duration due to adjacent letters. 

Learning a Letter. Figures 13 and 14 illustrate the learning process as 
AVITEWRITE learns to write the cursive letter 1 by tracing a template curve for 
thirty-seven trials. On early trials, mistakes are made as the newly forming memory 
competes for control of the movement with visually reactive movements to targets on 
the curve. Memory control is initially poor and requires corrective reactive movements 
which yield a segmented trajectory and a velocity profile that consists of several dis- 
crete peaks. As learning proceeds over multiple trials, performance gradually im- 
proves and the writing time decreases until, on trial thirty-seven in this case, the mem- 
ory representation of the synergy activations is able to drive an accurate, fast writing 
movement which does not deviate from the attentional radius around the template 



curve. 
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Fig. 13. Learning the letter /. Left. The attentional focus is illustrated by the tube 
around the dashed template curve. Circles indicate the PPV when a new target, 
marked by a square, is chosen, either because memory is too small or because the PPV 
has exceeded the distance, r^, from the template curve. Middle: AVITEWRITE’s / 
viewed in isolation. Right, x (top) and y (bottom) velocity profiles, Vx, Vy. (a) Learn- 
ing trial 1; (b) Learning trial 12; (c) Final learning trial 37. The letter is now drawn 
without deviating from the attentional radius around the template curve. Note also that 
the writing time has decreased from over 25 to under 1 1 time units. 
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Fig. 14. Model components during learning of the letter / of Figure 3.12. Left trial 1; 
Right, trial 37; Top: Positive and negative x synergies; Bottom: Positive and negative y 
synergies; 

Figure 14 shows the dynamics of several model components during the learning 
process. The visual difference vector (DV^J from the present position (PPV) to a tar- 
get (TPV) is integrated through time as it competes with memory, R, to control the 
movement. If R is less than a threshold value of e or if movement exceeds a distance 
from the template curve, then a target, TPV, is chosen and 77 grows toward the 
value oi TPV -PPV. If 7? > e and the PPV is within a distance of the template curve, 
then 77 F decays toward zero. The Purkinje cell population response, R, which forms 
the cerebellar memory output, is shaped by learning as the parallel fiber/Purkinje cell 
synaptic weights are modified based on the error signal TPV - PPV. Note that on trial 
37 (right side of figure), memory alone controls movement and keeps it within the 
attentional radius of the template curve. No errors are made and 77F^.^ and TPV - 
PPV equal zero throughout the learned movement. Figure 15 shows the corresponding 
spectral activations during trial 37. Figure 16 shows a sample of how the model can 
learn all the letters of the alphabet. 
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Fig. 15. Purkinje cell spectra during trial 37 of learning the letter /: Top: Spectrum of 
Purkinje cell responses (g) generated using Equation (2). Input to the spectrum of one 
synergy is shut off when the net movement direction, given by + R, changes sign. 
A new synergy and Purkinje cell speetrum are then activated. Sueh synergy switehing 
occurs at approximately times ? = 4 and 7 in the positive and negative x synergies (left: 
gxp, gxn) and t = 6 and 9 in the positive and negative y synergies (right: gyp, gyn). 
Bottom'. The pattern of learned Purkinje cell activations {h) formed when g is gated by 
the parallel fiber/Purkinje cell synaptic weights (z in Equation 3) formed during 
learning. 



Inverse Relation between Cnrvatnre and Velocity. Figure 17 eompares three let- 
ters learned by AVITEWRITE with similar letters written by adult human subjects 
(Edelman & Flash, 1987). Unimodal x and y velocity profiles are generated for each 
synergy by both humans and AVITEWRITE, as is the inverse relation between tan- 
gential velocity and curvature. The peaks in curvature near the ends of the simulated 
trajectories are the result of the x and y velocities (Vx, Vy) getting very small, with 
Vx and Vy « 1. The curvature: 



(Vx'Ay)-(VyAx) 
(Fx2 + f/)1-5 



( 1 ) 



approaches infinity as the sum of Vx^ and Vy^ approaches zero. This effect is not seen 
in the human data shown in Figure 17 because the curvature has been truncated prior 
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to the end of the velocity profile where velocity reaches zero. Terms Ax and Ay are the 
X and y acceleration, respectively. 




Fig. 16 . The alphabet as learned by AVITEWRITE; Each panel contains a letter at the 
top with the x velocity profile in the middle and the y velocity profile at the bottom. 
All letters were learned at the relative scale shown here. The cross in the t, the letter x, 
and the dots on the i and j were omitted because they involved discontinuities in the 
movement, with lifting of the pen from the page and hand repositioning. 

The Two-Thirds Power Law. As curvature increases, the angular velocity required to 
move through the curve in a given amount of time also increases. Thus, angular ve- 
locity is a function of the curvature. This relation is quantified by the Two-Thirds 
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Power Law, which states that the angular velocity is proportional to the curvature 
raised to the two-thirds power (Lacquaniti et ah, 1983): 

2 

A = kC^, ( 2 ) 

where A = angular velocity, C = curvature, and k is a proportionality constant. 
Equivalently, 

1 

^tan=kr^’ ( 3 ) 




Fig. 17. Left: Human writing with x and y velocity profiles (Vx,Vy), movement cur- 
vature (C), and tangential velocity (Vtan) (Reproduced with permission from Edelman 
& Flash, 1987). Right: Similar shapes learned by AVITEWRITE. 
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where = tangential velocity, r = radius of curvature (1/C), and ^ is a proportionality 
constant. The law was originally reported to hold mainly for elliptical movements 
(Lacquaniti et ah, 1983). Since then, others (Wann et al., 1988, p. 635) have reported 
that the law holds for handwriting movements at fast speeds. The law is violated when 
"size differences and translation are combined in a word" (Thomassen & Teulings, 
1985, p. 260). Nevertheless, the law holds under many conditions in human handwrit- 
ing movements. The Two-Thirds Power Law relation emerges from the learning proc- 
ess described in the current model (Figure 18). The Two-Thirds Power Law prediction 
of tangential velocity becomes unrealistically large as the curvature of the movement 
becomes very small (C«l), as may occur near the beginning and end of a movement 
(Figure 17), causing the large spikes in the power law predictions in Figure 18. 
Smoothing the acceleration, as would be done by the motor plant, reduces the number 
of these spikes. 






Fig. 18. Two-Thirds Power Law predictions (dotted lines) of tangential velocity com- 
pared to the actual tangential velocity (solid lines) of AVITEWRITE for the letters O, 
U, gamma, and /. For each letter, the top panel shows the power law prediction calcu- 
lated using the unfiltered model acceleration profile, and the bottom panel with fil- 
tered outputs. 

Variable Speeds During Learning. A task must usually be performed more slowly 
during the early stages of learning than at later stages. Increasing the speed of per- 
formance before the motor system has adequately learned the task results in more er- 
rors. Such a gradual speed increase occurs while learning to play musical instruments 
or a new language. A similar phenomenon occurs during the learning of handwriting 
movements (Alston & Taylor, 1987, p. 115; Bums, 1962, pp. 45-46; Freeman, 1914, 
pp. 83-84). 
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trials 




Fig. 19. Letters learned with variable speed compared to learning at a constant, fast 
speed. In (a) and (c), the GO signal (/) and spectral density (calibrated by 't) were 
held constant (J = 20, H = 0.1). In (b) and (d), the GO signal and spectral density were 
incrementally increased every two trials (starting at J= 19.25, H = 0.25; ending at J = 
20, H = 0.1). The result was an increase in the range of movement durations, (a) 
through (d): Left'. Letter learned by AVITE WRITE; Middle', x andy velocity profiles, 
Vx, Vy; Right: (top) trials versus movement duration (md); (middle) Jover the course 
of learning; (bottom) H over the course of learning. 

Figure 19 shows that this gradual decrease of movement duration over multiple 
learning trials is a feature of AVITEWRITE's learning as well. The decrease in 
movement duration over the course of learning in AVITEWRITE may occur for two 
reasons: (1) In the early trials, the memory is not yet fully developed. As a result, the 
movement repeatedly deviates from the attentional radius around the template curve 
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being traced, and the total distance moved may exceed the length of the template 
curve (Figure 13a). As learning progresses, the movement remains within the atten- 
tional radius more and more, so the total movement distance may decrease (Figures 
13b, and 13c). (2) Since fewer DV^.Js contribute to forming the memory at earlier tri- 
als (the memory forms a cumulative representation of all the DV^^Js over all past 
learning trials), the size of the memory signal R may be smaller at a given time for 
earlier trials as compared to later trials. Movement velocity scales with the size of the 
cerebellar memory output, R. This increase in the size of the memory signal over the 
course of learning can also lead to a speed increase and a decrease in movement dura- 
tion as learning progresses. 

In addition to a decrease of movement duration resulting from the learning mecha- 
nism described above, a person may also voluntarily alter the speed of a movement. 
The model allows for such speed scaling during learning by varying the volitional GO 
signal along with the density of the cerebellar spectra which are sampling the move- 
ment error signals. Altering spectral density can also alter the size of the memory sig- 
nal, R, generated at a given time. Since the movement velocity is proportional to the 
size of R, the speed is altered both by changes in the GO signal and by changes in the 
spectral density. If the execution rate of movement commands stored in the working 
memory is reduced by decreasing movement speed via the GO signal, error feedback 
to the cerebellum is delayed. 




0 5 10 IS 20 25 30 



Fig 20. Speed scaling of the letter / with preservation of the letter shape and the shape 
of the X and y velocity profiles, Vx, Vy. Top: Letter / with the GO signal input J = 1. 
Bottom: Letter / with the GO signal input J= 20. 

Speed-Scaling of a Learned Movement. Previously learned movements can be writ- 
ten at a wide range of speeds with relatively little distortion of the shape of the move- 
ment or the velocity profiles. Wright (1993) has shown that the speed of handwriting 
movements can be varied by a factor of about 2.8 (a range of 0.6 to 1.66 times the 
baseline speed) without significantly altering the letter shape. Presumably, no new 
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learning takes place during such speed-scaling since the letters have been written by 
the subjects for years. 

The model yields speed-scaling by a comparable factor without shape or velocity 
profile distortion, as shown in Figure 20. These results are obtained through the use of 
a working memory buffer which transiently stores the outputs of the cerebellar long 
term memory and sends them on to the motor apparatus at a rate which can be de- 
creased relative to the rate of cerebellar readout (Figure 9). Speed is altered by varying 
the size of the GO signal. 

If learning has been completed at some final spectral density, altering spectral den- 
sity thereafter can result in distortions of the movement and its velocity profile. Thus, 
attempting to control the speed of learned movements by altering spectral density 
alone may trigger new movement errors. Instead, AVITEWRITE uses the volitional 
GO signal in conjunction with the working memory system to yield speed scaling with 
shape invariance. Since no new learning is required, and hence no delayed error feed- 
back, the spectral density is kept constant at the value reached on the last learning trial 
at which error-free movement was achieved. The model assumes that an attentional 
gate couples the GO signal and spectral density during attentive imitation, but that 
they are decoupled during automatic performance of a previously learned letter. 

Size Scaling and Isochrony. Size can be scaled in the model by varying the volitional 
GRO signal in Figure 9. Using the same GRO value for both horizontal and vertical 
directions will uniformly alter the size of a letter without altering the ratio of height to 
width (Figure 21). However, Wann and Nimmo-Smith (1990) have shown that hu- 
mans do alter this ratio when scaling letter sizes; that is, vertical and horizontal sizes 
can be scaled independently. In their experiment of size scaling, subjects were found 
to increase the horizontal (x) component of movement by 46% and the vertical (y) 
component by 78% (p. 111). Figure 22 shows the result of a simulation in which dif- 
ferent GRO values are used for the horizontal and vertical directions, with the x syner- 
gies’ GRO signal Sx increased 46% and Sy by 78%, relative to the value used during 
learning. 

Human handwriting exhibits isochrony; namely, the tendency for shapes of differ- 
ent sizes to be drawn in the same amount of time. Isochrony is also a feature of the 
model’s performance, as seen in Figures 21 and 22. Isochrony in humans is observed 
at small sizes, but it fails at large sizes; that is, the isochrony principle is valid within 
the "neighborhood of normal letter heights (approx. 0.5 cm) [but the] writing time will 
increase at some point where force demands become too high" (Thomassen & 
Teulings, 1985, p. 255). "Writing time is not invariant across changes in writing size, 
but increases by a small amount" (Wright, 1993, p. 49). These limits of isochrony may 
be due to the physical limitations of the hand/arm system and/or limits of the central 
force-control mechanisms of the brain, as exemplified in the extreme case of Parkin- 
son’s disease patients who appear to have a "reduced capability to maintain a given 
force level for the [prolonged] stroke time periods" required when letter size is greatly 
increased (Van Gemmert et ah, 1999, p. 685). 
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Coarticulatory Context Effects in Handwriting. How a cursive letter is written may 
be affected by adjacent, connected letters. Thomassen and Schomaker (1986) demon- 
strated context effects which they assume are due to coarticulation; that is, "anticipa- 
tory and overlapping instructions to the motor system" (p. 257). Different sets of mus- 
cles with separate goals can be working simultaneously, or the same set of muscles 
can be receiving motor commands to carry out separate goals. In the latter case, the 
muscles’ movements may be a summation or averaging of the commands they receive. 
If conflicting commands are received, some muscles in a group which usually work 
together toward a common goal may carry out one command while other muscles in 
the group carry out other commands (Ohman, 1965, pp. 166, 168; Fowler et ah, 1993, 
p. 179). 




Vx 

S = 0.15 
Vy 



Vx 

S = 0,6 
Vy 



Fig. 21. Size scaling with isochrony. The dashed letter / is the template curve traced 
during learning with a baseline, size-scaling GRO signal S = 0.3. 5 = 0.15 for the 
smaller, solid / written by AVITE WRITE, and 5 = 0.6 for the larger, solid /. Both the 
large and the small / are written in the same amount of time, as seen in the x and y 
velocity profiles, Vx, Vy. 




Fig. 22. Independent scaling of horizontal and vertical components of size. The small, 
dashed letter / is the template curve traced during learning with a baseline, size-scaling 
GRO signal parameters 5^ = 5 = 0.3. The two larger /’s both have a y GRO signal pa- 
rameter 5 = 0.53. The large, dash-dotted / has an x GRO signal of 5) = 0.44 corre- 
sponding to the dotted x velocity profile, Vx, while the large, solid / has 5 = 0.53 with 
a solid X velocity profile. 
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Thomassen and Schomaker (1986) found that "more rapid writers... display 
stronger context effects than slower writers" (p. 257). This finding is consistent with 
the observed increase in speech carryover coarticulation with increases in speaking 
rate. "Carryover" ("perseverative", "left to right") coarticulation occurs when new 
motor commands are given before the previous commands have been fully executed. 
Muscles then begin contracting in a new pattern before the previous pattern of muscle 
contractions has been completed (Ostry et al., 1996). 




Fig. 23. Simulated combinations of the letters e and /. Left. The letters; Middle: x and 
y velocity profiles, Vx, Vy; Right: Tangential velocity, Vtan. 

The idea that some of the observed context effects in handwriting are due to car- 
ryover coarticulation was tested as follows. Connected letters were simulated with 
varying degrees of overlap of the corresponding spectral memories. In other words, 
the degree of superposition between adjacent letters was varied. The letters e and / 
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were learned by the modeled system (Figures 23). The learned memory traces were 
then read out successively with varying degrees of overlap. Some of the downstroke 
duration and size effects observed by Thomassen and Schomaker (1986) were repli- 
cated by varying the degree of superposition between adjacent letters. In the simula- 
tion of the word eele, shown in Figure 24, the relative timing of the loading of the pre- 
viously learned letter memories was varied and the sizes of the letters were compared. 
The second e becomes smaller than the other e’s when its superposition increases with 
the large vertical upstroke of the following /, thereby canceling a large part of the e 
downstroke (Figures 24b and 24c). Increasing the time separation between letters can 
eliminate the coarticulatory size effects in the model, as seen in Figure 24a. 



Vx,Vy Overlap 




(b) 





Vx 




Vy 



(c) 





Vx 




Vy 




Fig. 24. Simulations of coarticulation: (a) through (c): Simulated eele with varying 
degrees of overlap between the letters. Timing relations are as follows, (a) 6.6, 6.6, 7 
(The second letter begins 6.6 time units after the first; the third starts 6.6 after the sec- 
ond, and the fourth starts 7 time units after the third, corresponding to the second Vx 
zero crossings shown in Vx Overlap.) Vx,Vy Overlap show the overlapping velocity 
profiles of the individual letters, (b) 5, 5, 7; (c) 6.6, 5, 7; (d) Human writing of eele by 
two subjects (Figure (d) reproduced with permission from Thomassen & Schomaker, 
1986). The dotted y velocity profile, Vy, corresponds to the dotted eele. 
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9, Conclusion 

The AVITEWRITE model clarifies aspects of how a person leams to make curved 
handwriting movements. This model incorporates elements of two previous groups of 
models: spectral timing models that analyze cerebellar learning (Fiala, Grossberg, & 
Bullock, 1996; Grossberg & Merrill, 1992; Grossberg & Schmajuk, 1989) and the 
VITE and VITEWRITE models of how the cerebral cortex and basal ganglia work 
together to control the read-out of trajectory commands (Bullock & Grossberg, 1988a, 
1988b, 1991; Bullock, Grossberg & Maimes, 1993). The AVITEWRITE model hereby 
seeks to clarify how the cerebral cortex, cerebellum, and basal ganglia may interact 
during complex learned movements. There is both cooperation and competition be- 
tween reactive vision-based imitation and planned memory readout. The cooperation 
includes interactions between cortical difference vectors and cerebellar learning. The 
competition arises between cerebellar control of learned movements and error-driven, 
cortical control of reactive movements to attentionally chosen visual targets. The 
model suggests that there is an automatic shift in the balance of movement control 
between these cortical and cerebellar processes during the course of learning. Reactive 
movements are made to attentionally chosen targets on a curve at the same time as 
movement error signals are generated which allow the cortico-cerebellar system to 
learn how to draw the curve. Memory-based movements gradually supersede visually- 
driven movements as learning progresses. Finally, the model shows how challenging 
psychophysical properties of planar hand movements may emerge from this cortico- 
cerebellar-basal ganglia interaction. 
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